SlideShare a Scribd company logo
1 of 42
Microsoft Analytics Platform
System (APS)
Modern Data Warehousing
James Serra
Big Data Evangelist
Microsoft
Agenda
• Traditional data warehouse & modern data warehouse
• APS architecture
• Hadoop & PolyBase
• Performance and scale
• Appliance benefits
• Summarize/questions
5
Data sources
Will your current solution handle future needs?
10
Data sourcesNon-Relational Data

Data sources Non-relational data
Are you using or going to
use “Big Data” and/or
“Hadoop”
No or limited access to
detailed data; can only
surface reports and cannot
ask ad-hoc questions.
Slow data loading
performance cannot keep
up with the need for data
from transactional systems
for intraday reporting.
MOLAP cube processing
and data refresh take too
long.
Slow query performance
with need for constant
tuning, especially with SAN
storage.
High cost of SAN storage
chargeback.
Keep legacy
investment
Buy new tier one
hardware appliance
Acquire big data
solution (Hadoop)
Acquire business
intelligence solution
Roadblocks to evolving to a modern data warehouse
Limited
scalability & ability to
handle new data types
Significant training
& still siloed
High acquisition/
migration
costs & no Hadoop
Complex with low
adoption
Solution and issue with that solution
Introducing the Microsoft Analytics Platform System
Your turnkey modern data warehouse appliance
• Relational and non-relational data in
a single appliance
• Or, integrate relational data with
non-relational data in an external
Hadoop cluster on premise or data
stored in the Cloud (hot, warm, cold)
• Enterprise-ready Hadoop
• Integrated querying across Hadoop
and APS using T-SQL (PolyBase)
• Direct integration with Microsoft BI
tools such as Power BI
• Near real-time performance with In-
Memory
• Scale-out to accommodate your
growing data or to increase
performance (2-nodes to 56-nodes)
• Remove SMP DW bottlenecks with
MPP SQL Server
• No rip and replace when more
performance needed
• No performance tuning required
• Concurrency that fuels rapid
adoption
• Industry’s lowest DW price/TB
• Value through a single appliance
solution
• Value with flexible hardware options
using commodity hardware
• Free up space on SAN (cost averages
10k per TB)
Hardware appliance vendor offerings
Hardware and software engineered together
The ease of an appliance
Co-engineered
with HP, Dell, and
Quanta best
practices
Leading
performance with
commodity
hardware
Pre-configured,
built, and tuned
software and
hardware
Integrated
support plan with
a single Microsoft
contactPDW
HDInsight
PolyBase
APS History
• DatAllegro started in 2003
• Microsoft acquires DatAllegro in September 2008
• PDW released in December 2010 (version 1)
• Version 2 made available in March, 2013 (PolyBase introduced)
• AU1 released in April 2014. Renamed from Parallel Data Warehouse (PDW) to Analytics Platform System (APS). It
still includes the PDW region as well as a new HDInsights/Hadoop region
• AU2 was released in July 2014
• AU3 released in October 2014
There will be AU updates every 3-4 months.
NOTE: This is a Data Warehouse solution and not an OLTP (online transaction processing) solution.
Case studies: Go to https://customers.microsoft.com and enter "parallel data warehouse" (old name) in the keyword
box and search the results, then enter "analytics platform system“ (new name)
Parallelism
• Uses many separate CPUs running in parallel to execute a single program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel
Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• All SQL Server implementations up until now have been SMP
• Mostly, the solution is housed on a shared SAN
SMP - Symmetric
Multiprocessing
APS Logical Architecture (overview)
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
DMS
DMS
DMS
DMS
Compute Node – the “worker bee” of APS
• Runs SQL Server 2014 APS
• Contains a “slice” of each database
• CPU is saturated by storage
Control Node – the “brains” of the APS
• Also runs SQL Server 2014 APS
• Holds a “shell” copy of each database
• Metadata, statistics, etc
• The “public face” of the appliance
Data Movement Services (DMS)
• Part of the “secret sauce” of APS
• Moves data around as needed
• Enables parallel operations among the compute
nodes (queries, loads, etc)
“Control” node
SQL
DMS
APS Logical Architecture (overview)
“Compute” node Balanced storage
SQL“Control” node
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
DMS
DMS
DMS
DMS
DMS
1) User connects to the appliance (control node)
and submits query
2) Control node query processor determines best
*parallel* query plan
3) DMS distributes sub-queries to each compute
node
4) Each compute node executes query on its
subset of data
5) Each compute node returns a subset of the
response to the control node
6) If necessary, control node does any final
aggregation/computation
7) Control node returns results to user
Queries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger
APS Data Layout Options
“Compute” node Balanced storage
SQL
Balanced storage
Balanced storage
Balanced storage
“Compute” node
SQL
“Compute” node
SQL
“Compute” node
SQL
DMS
DMS
DMS
DMS
Time Dim
Date Dim ID
Calendar Year
Calendar Qtr
Calendar Mo
Calendar Day
Store Dim
Store Dim ID
Store Name
Store Mgr
Store Size
Product Dim
Prod Dim ID
Prod Category
Prod Sub Cat
Prod Desc
Customer Dim
Cust Dim ID
Cust Name
Cust Addr
Cust Phone
Cust Email
Sales Fact
Date Dim ID
Store Dim ID
Prod Dim ID
Cust Dim ID
Qty Sold
Dollars Sold
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
SalesFact
Replicated
Table copied to each compute node
Distributed
Table spread across compute nodes based on “hash”
Star Schema
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
DATA DISTRIBUTION
CREATE TABLE FactSales
(
ProductKey INT NOT NULL ,
OrderDateKey INT NOT NULL ,
DueDateKey INT NOT NULL ,
ShipDateKey INT NOT NULL ,
ResellerKey INT NOT NULL ,
EmployeeKey INT NOT NULL ,
PromotionKey INT NOT NULL ,
CurrencyKey INT NOT NULL ,
SalesTerritoryKey INT NOT NULL ,
SalesOrderNumber VARCHAR(20) NOT NULL,
) WITH
(
DISTRIBUTION = HASH(ProductKey),
CLUSTERED INDEX(OrderDateKey) ,
PARTITION
(OrderDateKey RANGE RIGHT FOR VALUES
( 20010601,
20010901,
) ) );
Control Node
…Compute Node 1 Compute Node 2 Compute Node X
Send Create Table SQL to each compute node
Create Table FactSales_A
Create Table FactSales_B
Create Table FactSales_C
……
Create Table FactSales_H
FactSalesA
FactSalesB
FactSalesC
FactSalesD
FactSalesE
FactSalesF
FactSalesG
FactSalesH
FactSalesA
FactSalesB
FactSalesC
FactSalesD
FactSalesE
FactSalesF
FactSalesG
FactSalesH
FactSalesA
FactSale B
FactSalesC
FactSalesD
FactSalesE
FactSalesF
FactSalesG
FactSalesH
Create table metadata on
Control Node
APS – Balanced across servers and within
41
Largest Table 600,000,000,000
Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000
In each server randomly distributed to 8 tables (so 320 total tables) 1,875,000,000
Each partition – 2 years data partitioned by week (benefiting queries by date) 18,028,846
As an end user or DBA you think about 1 table: LineItem.
“Select * from LineItem” is split into 320 queries running in parallel against 320 (1.875b row) tables.
“Select * from LineItem where OrderDate = ‘1/1/2014’ is 320 queries against 320 (18m row) tables.
You don’t care or need to know that there are actually 320 tables representing your 1 logical table.
CCI can add further performance via segment elimination.
¼Rack
15TB
(Uncompressed)
1/2Rack
30TB(Uncompressed)
FullRack
60TB(Uncompressed)
1¼Rack
75.5TB
(Uncompressed)
3Rack
181.2TB(Uncompressed)
11/2Rack
90.6TB(Uncompressed)
2Rack
120.8TB(Uncompressed)
• 2 – 56 compute nodes (32-
896 cores)
• 1 – 7 racks
• 1, 2, or 3 TB drives
• 15TB – 1.2PB uncompressed
• 75TB – 6PB User data (5:1)
• Up to 7 spare nodes
available across the entire
appliance
• Dual Infiband: 56Gbps
Microsoft Analytics Platform System
Your turnkey modern data warehouse appliance
Advanced Analytics Defined
What is Hadoop?
Microsoft Confidential
 Distributed, scalable system on commodity HW
 Composed of a few parts:
 HDFS – Distributed file system
 MapReduce – Programming model
 Other tools: Hive, Pig, SQOOP, HCatalog, HBase,
Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie,
ZooKeeper, Flume, Storm
 Main players are Hortonworks, Cloudera, MapR
 WARNING: Hadoop, while ideal for processing huge
volumes of data, is inadequate for analyzing that
data in real time (companies do batch analytics
instead)
Core Services
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
OOZIE
AMBARI
YARN
MAP
REDUCE
HIVE &
HCATALOG
PIG
HBASEFALCON
Hadoop Cluster
compute
&
storage . . .
. . .
. .
compute
&
storage
.
.
Hadoop clusters provide
scale-out storage and
distributed data processing
on commodity hardware
Move HDFS into the warehouse before analysis
ETL
Learn new skills
TSQL
Build
Integrate
Manage
Maintain
Support
Complex query and analysis with big data today
Steep learning curve, slow and inefficient
Hadoop ecosystem
“New” data sources
“New” data sources“New” data sources
APS delivers enterprise-ready Hadoop with HDInsight
Manageable, secured and highly available Hadoop integrated into the appliance
High performance
tuned within the
appliance
End-user
authentication with
Active Directory
Accessible insights
for everyone with
Microsoft BI tools
Managed and
monitored using
System Center
100% Apache
Hadoop
SQL Server
Parallel Data
Warehouse
Microsoft
HDInsight
PolyBase
Leverage your
existing TSQL skills
Additional features over a separate Hadoop cluster
Plus one support contact still!
Parallel Data Warehouse
region
HDInsight region
Fabric
Hardware
Appliance
A region is a logical container within an
appliance
Each workload contains the following
boundaries:
• Security
• Metering
• Servicing
APS appliance overview
Select… Result set Provides a single T-SQL query model (“semantic
layer”) for APS and Hadoop with rich features of
T-SQL, including joins without ETL
Uses the power of MPP to enhance query
execution performance
Supports Windows Azure HDInsight to enable
new hybrid cloud scenarios
Provides the ability to query non-Microsoft
Hadoop distributions, such as Hortonworks and
Cloudera
Use existing SQL skillset, no IT intervention
Query Hadoop data with T-SQL using PolyBase
Bringing the worlds or big data and the data warehouse together for users and IT
SQL Server
Parallel Data
Warehouse
Cloudera CHD Linux 5.1
Hortonworks HDP 2.2
(Windows, Linux)
Windows Azure
HDInsight (HDP 2.2)
(WASB)
PolyBase
Microsoft
HDInsight
HDP 2.0
Others (SQL Server, DB2, Oracle)?
True federated query engine
Use cases where PolyBase simplifies using Hadoop data
Bringing islands of Hadoop data together
High performance queries against Hadoop data
(Predicate pushdown)
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)
Big data insights for anyone
Native Microsoft BI integration to create new insights with familiar tools
Tools like Power
BI minimize IT
intervention for
discovering data
T-SQL for DBA
and power
users to join
relational and
Hadoop data
Hadoop tools
like map-
reduce, Hive
and Pig for data
scientists
Leverages high
adoption
of Excel, Power
View, Power
Pivot, and SSAS
Power Users
Data Scientist
Everyone else using
Microsoft BI tools
Microsoft Analytics Platform System
Your turnkey modern data warehouse appliance
Scale-out  Massively Parallel Processing (MPP) parallelizes
queries (speed-driven not just capacity-driven)
 Multiple nodes with dedicated CPU, memory,
storage “shared-nothing”
 Incrementally add HW for near-linear scale to
multi-PB (no need to delete older data, stage)
 Handles query complexity and concurrency at scale
 No “forklift” of prior warehouse to increase capacity
 Start small with a few terabyte warehouse
 Mixed workload support: Query while you load
(250GB/hour per node). No need for maintenance
window
Scale-out technologies in the Analytics Platform System
91
PDW
0TB 6PB
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
• Store data in columnar format for massive
compression
• Load data into or out of memory for next-
generation performance
• Updateable and clustered for real-time trickle
loading
• No secondary indexes required
92
Up to 100x
faster queries
Updatable clustered columnstore vs. table with customary indexing
Up to 15x
more compression
Columnstore index representation
Parallel query execution
Query
Results
Investment firm Before/After Results - HP
SMP vs APS
21x
improvement
loading data
(7:30 minutes vs 21
seconds)
62x
improvement
staging to
landing (30
minutes vs 29
seconds)
17x, 166x, 169x
query
performance
improvement
(1:05 hour vs 23
seconds)
Microsoft BI
tools work
unchanged
1.1 TB/hr loading
time, 8.8x
compression (2
billion rows)
(472GB to 53GB)
46x
improvement
creating
datamart (70
minutes vs 1:31
minutes)
BI Tools
Reporting and cubes
SQL Server SMP (Spoke)
Concurrency that fuels rapid adoption
Great performance with mixed workloads
Analytics Platform System
ETL/ELT with SSIS, DQS, MDS
ERP CRM LOB APPS
ETL/ELT with DWLoader
Hadoop / Big Data
PDW
HDInsight
PolyBase
Ad hoc queries
Intra-Day
Near real-time
Fast ad hoc
Columnstore
Polybase
CRTAS
“Link Table”
Real-Time
ROLAP / MOLAP
DirectQuery
SNAC
Stream Analytics
TransformIngest
Example overall data flow and Architecture
Web logs
Present &
decide
IoT, Mobile Devices
etc.
Social Data
Event Hubs HDInsight
Azure Data
Factory
Azure SQL DB
Azure Blob Storage
Azure Machine
Learning
(Fraud detection
etc.)
Power BI
Web
dashboards
Mobile devices
DW / Long-term
storage
Predictive
analytics
Event & data
producers
Analytics Platform Sys.
Microsoft Analytics Platform System
Your turnkey modern data warehouse appliance
APS provides the industry’s lowest DW appliance price/TB
Reshaped hardware specs through software innovation
Price per terabyte for leading vendors (Sept 2014)
Significantly lower price
per TB than the closest
competitor
Lower storage costs
with Windows Server 2012
Storage Spaces
Small cost gap between multiple clustered HP DL980's with SAN vs APS 1/4 rack
$-
$20
$40
$60
$80
$100
$120
$140
Oracle Pivotal IBM Teradata Microsoft
Thousands
TCO per TB (uncompressed):
Virtualized architecture overview
Host 2
Host 1
Host 3
Host 4
Economical
disk storage
IB and
Ethernet
Direct attached SAS
Base Unit
CT
L
M
AD
A
D
V
M
M
Compute 2
Compute 1
• APS engine
• DMS Manager
• SQL Server 2012 Enterprise Edition (APS build) (AU3: SQL 2014)
Software details
• All hosts run Windows Server 2012 Standard (AU3:
2012 R2) and Windows Azure Virtual Machines
• Fabric or workload in Hyper-V Virtual Machines
• Fabric virtual machine, management server (MAD01),
and control server (CTL) share one server
• APS agent that runs on all hosts and all virtual
machines
• DWConfig and Admin Console
• Windows Storage Spaces and Azure Storage blobs
• Does not require expertise in Hyper-V or Windows
APS High-Availability
X
X
Compute
Host 1
Compute
Host 2
XControl Host
Failover Host
Infiniband1
Ethernet1
Infiniband2
Ethernet2
XXXFAB AD VMM MAD CTL
Compute 2 VM
Compute 1 VMCompute 1 VMInfiniband1
Ethernet1
• No Single Point-Of-Failure
• No need for SQL Server Clustering
Less DBA Maintenance/Monitoring
• No index creation
• No deleting/archiving data to save space
• Management simplicity (System Center, Admin console, DMVs)
• No blocking
• No logs
• No query hints
• No wait states
• No IO tuning
• No query optimization/tuning
• No index reorgs/rebuilds
• No partitioning
• No managing filegroups
• No shrinking/expanding databases
• No managing physical servers
• No patching servers and software
RESULT: DBA’s spend more of their time as architects and not baby sitters!
The no-compromise modern data warehouse solution
Microsoft’s turn-key modern data warehouse appliance
Analytics Platform System
Microsoft
• Improved query performance
• Faster data loading
• Improved concurrency
• Less DBA maintenance
• Limited training needed
• Use familiar BI tools
• Ease of appliance deployment
• Mixed workload support
• Improved data compression
• Scalability
• High availability
• PolyBase
• Integration with cloud-born data
• HDInsight/Hadoop integration
• Data warehouse consolidation
• Easy support model
Summary of Benefits
Bold = benefits of APS over upgrading to SQL Server 2014, no worry about future hardware roadblocks
Questions?
James Serra
jserra@microsoft.com
Blog about PDW topics:
http://www.jamesserra.com/archive/category/pdw/
Enterprise-ready big data – cloud
enabled
• Improved PolyBase Support
• Cloudera 5.1 Support
• Partial Aggregate Pushdowns
• Expanding Big Data capacity
• Grow HDInsight region on an appliance
with an existing region
Next-gen performance & engineered
for optimal value
• 1.5X data return rate for SELECT * queries
• Streaming large data sets for external apps
(e.g., SSAS, SAS, R, etc.)
Next-gen performance &
engineered for optimal value
• TSQL Compatibility
• Scalar UDFs (CREATE Function)
• SQL Server SMP to APS (SQL Server
MPP) Migration Utility
• Bulk load / BCP through SQL Server
command-line tools
• OEM Hardware Refresh (HP Gen 9)
• HP ProLiant DL360 Gen9 Server w/2x
Intel Haswell Processors, 256 GB
(16x16Gb) 2133MHz memory
• HP 5900 series switches (HA
improvements)
Symmetry between DW On-Prem
and Azure
T-SQL Compat:
Appliance Hardware

More Related Content

What's hot

Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data SolutionJames Serra
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architectureSudheer Kondla
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Data Warehouse Architecture.pptx
Data Warehouse Architecture.pptxData Warehouse Architecture.pptx
Data Warehouse Architecture.pptx22PCS007ANBUF
 

What's hot (20)

Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data Solution
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Data Warehouse Architecture.pptx
Data Warehouse Architecture.pptxData Warehouse Architecture.pptx
Data Warehouse Architecture.pptx
 

Viewers also liked

Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics SuiteJames Serra
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsJames Serra
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategyJames Serra
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?James Serra
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine LearningJames Serra
 
Finding business value in Big Data
Finding business value in Big DataFinding business value in Big Data
Finding business value in Big DataJames Serra
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBaseJames Serra
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016James Serra
 
Benefits of the Azure cloud
Benefits of the Azure cloudBenefits of the Azure cloud
Benefits of the Azure cloudJames Serra
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
What is it like to work at Microsoft?
What is it like to work at Microsoft?What is it like to work at Microsoft?
What is it like to work at Microsoft?James Serra
 
Implement SQL Server on an Azure VM
Implement SQL Server on an Azure VMImplement SQL Server on an Azure VM
Implement SQL Server on an Azure VMJames Serra
 
Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesJames Serra
 

Viewers also liked (20)

Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics Suite
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Finding business value in Big Data
Finding business value in Big DataFinding business value in Big Data
Finding business value in Big Data
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Benefits of the Azure cloud
Benefits of the Azure cloudBenefits of the Azure cloud
Benefits of the Azure cloud
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
What is it like to work at Microsoft?
What is it like to work at Microsoft?What is it like to work at Microsoft?
What is it like to work at Microsoft?
 
Implement SQL Server on an Azure VM
Implement SQL Server on an Azure VMImplement SQL Server on an Azure VM
Implement SQL Server on an Azure VM
 
Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use Cases
 

Similar to Modern Data Warehousing with the Microsoft Analytics Platform System

Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overviewRohit Jain
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptxFedoRam1
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Cardinality-HL-Overview
Cardinality-HL-OverviewCardinality-HL-Overview
Cardinality-HL-OverviewHarry Frost
 

Similar to Modern Data Warehousing with the Microsoft Analytics Platform System (20)

Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Cardinality-HL-Overview
Cardinality-HL-OverviewCardinality-HL-Overview
Cardinality-HL-Overview
 

More from James Serra

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric IntroductionJames Serra
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernanceJames Serra
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI OverviewJames Serra
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
 
How to build your career
How to build your careerHow to build your career
How to build your careerJames Serra
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed InstanceJames Serra
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017James Serra
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at itJames Serra
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB James Serra
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 

More from James Serra (17)

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
How to build your career
How to build your careerHow to build your career
How to build your career
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 

Recently uploaded

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 

Recently uploaded (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Modern Data Warehousing with the Microsoft Analytics Platform System

  • 1. Microsoft Analytics Platform System (APS) Modern Data Warehousing James Serra Big Data Evangelist Microsoft
  • 2. Agenda • Traditional data warehouse & modern data warehouse • APS architecture • Hadoop & PolyBase • Performance and scale • Appliance benefits • Summarize/questions
  • 3. 5 Data sources Will your current solution handle future needs?
  • 6. Are you using or going to use “Big Data” and/or “Hadoop” No or limited access to detailed data; can only surface reports and cannot ask ad-hoc questions. Slow data loading performance cannot keep up with the need for data from transactional systems for intraday reporting. MOLAP cube processing and data refresh take too long. Slow query performance with need for constant tuning, especially with SAN storage. High cost of SAN storage chargeback.
  • 7. Keep legacy investment Buy new tier one hardware appliance Acquire big data solution (Hadoop) Acquire business intelligence solution Roadblocks to evolving to a modern data warehouse Limited scalability & ability to handle new data types Significant training & still siloed High acquisition/ migration costs & no Hadoop Complex with low adoption Solution and issue with that solution
  • 8. Introducing the Microsoft Analytics Platform System Your turnkey modern data warehouse appliance • Relational and non-relational data in a single appliance • Or, integrate relational data with non-relational data in an external Hadoop cluster on premise or data stored in the Cloud (hot, warm, cold) • Enterprise-ready Hadoop • Integrated querying across Hadoop and APS using T-SQL (PolyBase) • Direct integration with Microsoft BI tools such as Power BI • Near real-time performance with In- Memory • Scale-out to accommodate your growing data or to increase performance (2-nodes to 56-nodes) • Remove SMP DW bottlenecks with MPP SQL Server • No rip and replace when more performance needed • No performance tuning required • Concurrency that fuels rapid adoption • Industry’s lowest DW price/TB • Value through a single appliance solution • Value with flexible hardware options using commodity hardware • Free up space on SAN (cost averages 10k per TB)
  • 10. Hardware and software engineered together The ease of an appliance Co-engineered with HP, Dell, and Quanta best practices Leading performance with commodity hardware Pre-configured, built, and tuned software and hardware Integrated support plan with a single Microsoft contactPDW HDInsight PolyBase
  • 11. APS History • DatAllegro started in 2003 • Microsoft acquires DatAllegro in September 2008 • PDW released in December 2010 (version 1) • Version 2 made available in March, 2013 (PolyBase introduced) • AU1 released in April 2014. Renamed from Parallel Data Warehouse (PDW) to Analytics Platform System (APS). It still includes the PDW region as well as a new HDInsights/Hadoop region • AU2 was released in July 2014 • AU3 released in October 2014 There will be AU updates every 3-4 months. NOTE: This is a Data Warehouse solution and not an OLTP (online transaction processing) solution. Case studies: Go to https://customers.microsoft.com and enter "parallel data warehouse" (old name) in the keyword box and search the results, then enter "analytics platform system“ (new name)
  • 12. Parallelism • Uses many separate CPUs running in parallel to execute a single program • Shared Nothing: Each CPU has its own memory and disk (scale-out) • Segments communicate using high-speed network between nodes MPP - Massively Parallel Processing • Multiple CPUs used to complete individual processes simultaneously • All CPUs share the same memory, disks, and network controllers (scale-up) • All SQL Server implementations up until now have been SMP • Mostly, the solution is housed on a shared SAN SMP - Symmetric Multiprocessing
  • 13. APS Logical Architecture (overview) “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL DMS DMS DMS DMS Compute Node – the “worker bee” of APS • Runs SQL Server 2014 APS • Contains a “slice” of each database • CPU is saturated by storage Control Node – the “brains” of the APS • Also runs SQL Server 2014 APS • Holds a “shell” copy of each database • Metadata, statistics, etc • The “public face” of the appliance Data Movement Services (DMS) • Part of the “secret sauce” of APS • Moves data around as needed • Enables parallel operations among the compute nodes (queries, loads, etc) “Control” node SQL DMS
  • 14. APS Logical Architecture (overview) “Compute” node Balanced storage SQL“Control” node SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL DMS DMS DMS DMS DMS 1) User connects to the appliance (control node) and submits query 2) Control node query processor determines best *parallel* query plan 3) DMS distributes sub-queries to each compute node 4) Each compute node executes query on its subset of data 5) Each compute node returns a subset of the response to the control node 6) If necessary, control node does any final aggregation/computation 7) Control node returns results to user Queries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger
  • 15. APS Data Layout Options “Compute” node Balanced storage SQL Balanced storage Balanced storage Balanced storage “Compute” node SQL “Compute” node SQL “Compute” node SQL DMS DMS DMS DMS Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Customer Dim Cust Dim ID Cust Name Cust Addr Cust Phone Cust Email Sales Fact Date Dim ID Store Dim ID Prod Dim ID Cust Dim ID Qty Sold Dollars Sold T D P D S D C D T D P D S D C D T D P D S D C D T D P D S D C D SalesFact Replicated Table copied to each compute node Distributed Table spread across compute nodes based on “hash” Star Schema
  • 16. FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H DATA DISTRIBUTION CREATE TABLE FactSales ( ProductKey INT NOT NULL , OrderDateKey INT NOT NULL , DueDateKey INT NOT NULL , ShipDateKey INT NOT NULL , ResellerKey INT NOT NULL , EmployeeKey INT NOT NULL , PromotionKey INT NOT NULL , CurrencyKey INT NOT NULL , SalesTerritoryKey INT NOT NULL , SalesOrderNumber VARCHAR(20) NOT NULL, ) WITH ( DISTRIBUTION = HASH(ProductKey), CLUSTERED INDEX(OrderDateKey) , PARTITION (OrderDateKey RANGE RIGHT FOR VALUES ( 20010601, 20010901, ) ) ); Control Node …Compute Node 1 Compute Node 2 Compute Node X Send Create Table SQL to each compute node Create Table FactSales_A Create Table FactSales_B Create Table FactSales_C …… Create Table FactSales_H FactSalesA FactSalesB FactSalesC FactSalesD FactSalesE FactSalesF FactSalesG FactSalesH FactSalesA FactSalesB FactSalesC FactSalesD FactSalesE FactSalesF FactSalesG FactSalesH FactSalesA FactSale B FactSalesC FactSalesD FactSalesE FactSalesF FactSalesG FactSalesH Create table metadata on Control Node
  • 17. APS – Balanced across servers and within 41 Largest Table 600,000,000,000 Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000 In each server randomly distributed to 8 tables (so 320 total tables) 1,875,000,000 Each partition – 2 years data partitioned by week (benefiting queries by date) 18,028,846 As an end user or DBA you think about 1 table: LineItem. “Select * from LineItem” is split into 320 queries running in parallel against 320 (1.875b row) tables. “Select * from LineItem where OrderDate = ‘1/1/2014’ is 320 queries against 320 (18m row) tables. You don’t care or need to know that there are actually 320 tables representing your 1 logical table. CCI can add further performance via segment elimination.
  • 18. ¼Rack 15TB (Uncompressed) 1/2Rack 30TB(Uncompressed) FullRack 60TB(Uncompressed) 1¼Rack 75.5TB (Uncompressed) 3Rack 181.2TB(Uncompressed) 11/2Rack 90.6TB(Uncompressed) 2Rack 120.8TB(Uncompressed) • 2 – 56 compute nodes (32- 896 cores) • 1 – 7 racks • 1, 2, or 3 TB drives • 15TB – 1.2PB uncompressed • 75TB – 6PB User data (5:1) • Up to 7 spare nodes available across the entire appliance • Dual Infiband: 56Gbps
  • 19. Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
  • 21.
  • 22. What is Hadoop? Microsoft Confidential  Distributed, scalable system on commodity HW  Composed of a few parts:  HDFS – Distributed file system  MapReduce – Programming model  Other tools: Hive, Pig, SQOOP, HCatalog, HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm  Main players are Hortonworks, Cloudera, MapR  WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead) Core Services OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS OOZIE AMBARI YARN MAP REDUCE HIVE & HCATALOG PIG HBASEFALCON Hadoop Cluster compute & storage . . . . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  • 23. Move HDFS into the warehouse before analysis ETL Learn new skills TSQL Build Integrate Manage Maintain Support Complex query and analysis with big data today Steep learning curve, slow and inefficient Hadoop ecosystem “New” data sources “New” data sources“New” data sources
  • 24. APS delivers enterprise-ready Hadoop with HDInsight Manageable, secured and highly available Hadoop integrated into the appliance High performance tuned within the appliance End-user authentication with Active Directory Accessible insights for everyone with Microsoft BI tools Managed and monitored using System Center 100% Apache Hadoop SQL Server Parallel Data Warehouse Microsoft HDInsight PolyBase Leverage your existing TSQL skills Additional features over a separate Hadoop cluster Plus one support contact still!
  • 25. Parallel Data Warehouse region HDInsight region Fabric Hardware Appliance A region is a logical container within an appliance Each workload contains the following boundaries: • Security • Metering • Servicing APS appliance overview
  • 26. Select… Result set Provides a single T-SQL query model (“semantic layer”) for APS and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera Use existing SQL skillset, no IT intervention Query Hadoop data with T-SQL using PolyBase Bringing the worlds or big data and the data warehouse together for users and IT SQL Server Parallel Data Warehouse Cloudera CHD Linux 5.1 Hortonworks HDP 2.2 (Windows, Linux) Windows Azure HDInsight (HDP 2.2) (WASB) PolyBase Microsoft HDInsight HDP 2.0 Others (SQL Server, DB2, Oracle)? True federated query engine
  • 27. Use cases where PolyBase simplifies using Hadoop data Bringing islands of Hadoop data together High performance queries against Hadoop data (Predicate pushdown) Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake)
  • 28. Big data insights for anyone Native Microsoft BI integration to create new insights with familiar tools Tools like Power BI minimize IT intervention for discovering data T-SQL for DBA and power users to join relational and Hadoop data Hadoop tools like map- reduce, Hive and Pig for data scientists Leverages high adoption of Excel, Power View, Power Pivot, and SSAS Power Users Data Scientist Everyone else using Microsoft BI tools
  • 29. Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
  • 30. Scale-out  Massively Parallel Processing (MPP) parallelizes queries (speed-driven not just capacity-driven)  Multiple nodes with dedicated CPU, memory, storage “shared-nothing”  Incrementally add HW for near-linear scale to multi-PB (no need to delete older data, stage)  Handles query complexity and concurrency at scale  No “forklift” of prior warehouse to increase capacity  Start small with a few terabyte warehouse  Mixed workload support: Query while you load (250GB/hour per node). No need for maintenance window Scale-out technologies in the Analytics Platform System 91 PDW 0TB 6PB PDW or HDInsight PDW or HDInsight PDW or HDInsight PDW or HDInsight PDW or HDInsight PDW or HDInsight
  • 31. • Store data in columnar format for massive compression • Load data into or out of memory for next- generation performance • Updateable and clustered for real-time trickle loading • No secondary indexes required 92 Up to 100x faster queries Updatable clustered columnstore vs. table with customary indexing Up to 15x more compression Columnstore index representation Parallel query execution Query Results
  • 32. Investment firm Before/After Results - HP SMP vs APS 21x improvement loading data (7:30 minutes vs 21 seconds) 62x improvement staging to landing (30 minutes vs 29 seconds) 17x, 166x, 169x query performance improvement (1:05 hour vs 23 seconds) Microsoft BI tools work unchanged 1.1 TB/hr loading time, 8.8x compression (2 billion rows) (472GB to 53GB) 46x improvement creating datamart (70 minutes vs 1:31 minutes)
  • 33. BI Tools Reporting and cubes SQL Server SMP (Spoke) Concurrency that fuels rapid adoption Great performance with mixed workloads Analytics Platform System ETL/ELT with SSIS, DQS, MDS ERP CRM LOB APPS ETL/ELT with DWLoader Hadoop / Big Data PDW HDInsight PolyBase Ad hoc queries Intra-Day Near real-time Fast ad hoc Columnstore Polybase CRTAS “Link Table” Real-Time ROLAP / MOLAP DirectQuery SNAC
  • 34. Stream Analytics TransformIngest Example overall data flow and Architecture Web logs Present & decide IoT, Mobile Devices etc. Social Data Event Hubs HDInsight Azure Data Factory Azure SQL DB Azure Blob Storage Azure Machine Learning (Fraud detection etc.) Power BI Web dashboards Mobile devices DW / Long-term storage Predictive analytics Event & data producers Analytics Platform Sys.
  • 35. Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
  • 36. APS provides the industry’s lowest DW appliance price/TB Reshaped hardware specs through software innovation Price per terabyte for leading vendors (Sept 2014) Significantly lower price per TB than the closest competitor Lower storage costs with Windows Server 2012 Storage Spaces Small cost gap between multiple clustered HP DL980's with SAN vs APS 1/4 rack $- $20 $40 $60 $80 $100 $120 $140 Oracle Pivotal IBM Teradata Microsoft Thousands TCO per TB (uncompressed):
  • 37. Virtualized architecture overview Host 2 Host 1 Host 3 Host 4 Economical disk storage IB and Ethernet Direct attached SAS Base Unit CT L M AD A D V M M Compute 2 Compute 1 • APS engine • DMS Manager • SQL Server 2012 Enterprise Edition (APS build) (AU3: SQL 2014) Software details • All hosts run Windows Server 2012 Standard (AU3: 2012 R2) and Windows Azure Virtual Machines • Fabric or workload in Hyper-V Virtual Machines • Fabric virtual machine, management server (MAD01), and control server (CTL) share one server • APS agent that runs on all hosts and all virtual machines • DWConfig and Admin Console • Windows Storage Spaces and Azure Storage blobs • Does not require expertise in Hyper-V or Windows
  • 38. APS High-Availability X X Compute Host 1 Compute Host 2 XControl Host Failover Host Infiniband1 Ethernet1 Infiniband2 Ethernet2 XXXFAB AD VMM MAD CTL Compute 2 VM Compute 1 VMCompute 1 VMInfiniband1 Ethernet1 • No Single Point-Of-Failure • No need for SQL Server Clustering
  • 39. Less DBA Maintenance/Monitoring • No index creation • No deleting/archiving data to save space • Management simplicity (System Center, Admin console, DMVs) • No blocking • No logs • No query hints • No wait states • No IO tuning • No query optimization/tuning • No index reorgs/rebuilds • No partitioning • No managing filegroups • No shrinking/expanding databases • No managing physical servers • No patching servers and software RESULT: DBA’s spend more of their time as architects and not baby sitters!
  • 40. The no-compromise modern data warehouse solution Microsoft’s turn-key modern data warehouse appliance Analytics Platform System Microsoft • Improved query performance • Faster data loading • Improved concurrency • Less DBA maintenance • Limited training needed • Use familiar BI tools • Ease of appliance deployment • Mixed workload support • Improved data compression • Scalability • High availability • PolyBase • Integration with cloud-born data • HDInsight/Hadoop integration • Data warehouse consolidation • Easy support model Summary of Benefits Bold = benefits of APS over upgrading to SQL Server 2014, no worry about future hardware roadblocks
  • 41. Questions? James Serra jserra@microsoft.com Blog about PDW topics: http://www.jamesserra.com/archive/category/pdw/
  • 42. Enterprise-ready big data – cloud enabled • Improved PolyBase Support • Cloudera 5.1 Support • Partial Aggregate Pushdowns • Expanding Big Data capacity • Grow HDInsight region on an appliance with an existing region Next-gen performance & engineered for optimal value • 1.5X data return rate for SELECT * queries • Streaming large data sets for external apps (e.g., SSAS, SAS, R, etc.) Next-gen performance & engineered for optimal value • TSQL Compatibility • Scalar UDFs (CREATE Function) • SQL Server SMP to APS (SQL Server MPP) Migration Utility • Bulk load / BCP through SQL Server command-line tools • OEM Hardware Refresh (HP Gen 9) • HP ProLiant DL360 Gen9 Server w/2x Intel Haswell Processors, 256 GB (16x16Gb) 2133MHz memory • HP 5900 series switches (HA improvements) Symmetry between DW On-Prem and Azure T-SQL Compat: Appliance Hardware

Editor's Notes

  1. Key goal of slide: To convey what every IT person knows: The data warehouse and what’s it for. Then we set-up the Gartner quote to say that there is a tipping point. End the slide with a question: Why is it at a tipping point?   Slide talk track: What is the “traditional” data warehouse? IT professionals know this well. A data warehouse or an enterprise data warehouse is a database that was designed specifically for data analysis. It is the single source of truth or the central repository for all data in the company. This means disparate data in the company coming from your transactional systems, your ERP, CRM or Line of Business applications would all be extracted, transformed, and cleansed and put into the warehouse. It was built so that the people who is accessing the warehouse using BI tools will be accessing data that has been provisioned by IT and represent accurate data sanctioned by the company. However, this traditional data warehouse is reaching an inflection point. Gartner in their analysis of the state of data warehousing noted that it is reaching the most significant tipping point since it’s inception. The question is why? What is going on?
  2. Key goal of slide: To convey that the traditional data warehouse is going to break in one of four different ways. These ways should also not be a surprise to the IT professionals. At the end of the slide, IT should be asking, what can I do to prevent my warehouse from breaking? Slide talk track: There are many reasons why data warehouses are at it’s tipping point where something needs to change. The first trend that will break my traditional data warehouse is data growth. Data volumes are expected to grow 10X over the next five years and traditional data warehouses cannot keep up with this explosion of data. In addition to growing data, end users have the expectation that they’ll need be able to get back query results faster in near real-time. End users are no longer apt to wait minutes to hours for their results which is something traditional data warehouses cannot keep up with. Also, want real-time data, not dated data pulled in during a maintenance window each night The third trend is new types of data captured that are “non-relational.” 85% of data growth is coming from “non-relational” data in the form of things like web logs, sensor data, social sentiment and devices. You’ve probably heard the term “Big Data” and “Hadoop” quite a bit. This is where these technologies come into play. More on that later…. The final trend that is appearing is cloud born data. This is data that might be coming from some of IT’s infrastructure that they are starting to host in the cloud (ie. CRM, ERP, etc) or not stored by any type of corporate owned system. How do you incorporate both on-premise and cloud data as part of your data warehouse? This is the last trend that is breaking the traditional data warehouse. However, this traditional data warehouse is reaching an inflection point. Gartner in their analysis of the state of data warehousing noted that it is reaching the most significant tipping point since it’s inception. The question is why? What is going on?
  3. Key goal of slide: To convey that the modern data warehouse is something that the traditional data warehouse must evolve to. To have IT agree that their warehouses need to take advantage of these new technologies (specifically focusing on the middle and bottom layer). Slide talk track: To encompass these four trends, we need to evolve our traditional data warehouse to ensure that it does not break. It needs to become the “modern data warehouse.” What is the “modern data warehouse?” This is the new warehouse that is able to excel with these new trends and can be your warehouse now and into the future. The modern data warehouse has the ability to: Handle all types of data. Whether it be your structured, relational data sources or your non-relational data sources, the Modern data warehouse will incorporate Hadoop. It can handle real-time data by using complex event processor technologies. Provide a way to enrich your data with Extract, Transform Load (ETL) capabilities as well as Master Data Management (MDM) and data quality Provide a way for any BI tool or query mechanism to interface with all these different types of data with a single query model that leverages a single query language that users already know (example: SQL). Questions drive BI, Analytics drive questions
  4. Top: solution choice, Bottom: problem if do Key goal of slide: To convey the limitations of current modern data warehouse options in the market. Slide talk track: Organizations are facing the challenge of now turning to two platforms for managing their data—relational database management systems (RDBMS) for traditional data and Apache Hadoop, the most widely used open source Big Data platform for large, non-relational data. Many Brand-new tier-one appliances are expensive. Major vendors offer tier-one RDBMS appliances. However, many of these come with a high price tag, averaging millions of dollars, and in-company politics may result into long struggles to approve and implement. Further, most of these appliances are focusing on point solutions instead of general and do not include a Hadoop solution, requiring a separate, additional appliance and ecosystem. Hadoop solutions are complex. Vendors can provide a Hadoop solution to you as their own distribution of Hadoop or an appliance that comes pre-installed with Hadoop. The problem is that the Hadoop ecosystem requires significant training investment, and a major effort is needed to integrate the Hadoop ecosystem. There is a steep learning curve and ongoing operational cost when your IT department needs to re-orient themselves around HDFS, MapReduce, Hive, and Hbase, rather than T-SQL and a standard RDBMS design. The result is often increased cost at a time when IT is expected to streamline. BI tools are unfamiliar. Surveys from Gartner, The BI Survey, and Intelligent Enterprise have found abysmal BI adoption of current solutions (~8%) due to complaints of the complexity of the tools and the cost of the solution. Users want tools they already know and can consume, but no vendor can deliver on all the solutions you need at a reasonable cost or in a natively-integrated manner. Troubleshooting, support and maintenance. Keeping up with configuration changes, support and maintenance with troubleshooting is not trivial. Today’s world of data is changing rapidly, and organizations need a modern data warehouse to adapt successfully to these changes. However, companies want the smoothest path to this transformation- a path where costs, downtime, and training are minimal, and where performance and accessibility to data insights are vastly improved.
  5. Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry.   Enterprise-ready Big Data: Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data Warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey Appliance APS integrates PDW and HDInsight to operate seamlessly together in a single appliance Integrated Querying across All Data Types Using T-SQL PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training. Enterprise-Ready Hadoop HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel Next-generation performance at scale: APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements. Scale-Out to accommodate your Growing Data APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out Remove DW bottlenecks with MPP SQL Server Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-Memory Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore Concurrency that Supports High Adoption Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads. Optimal architecture: More than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value: APS Provides the Industry’s Lowest DW Price/TB Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of APS storage with up to 15x compression via updateable in-memory columnstore Value through Single Appliance Solution Reduce hardware footprint by having PDW and HDInsight within a single appliance Remove the need for costly integration efforts Value through Flexible Hardware Options Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
  6. The Analytics Platform System is a pre-built appliance that ships to your door. As an appliance, all of the hardware has been pre-built: servers, storage arrays, switches, power, racks, and more. Also, all the software has been installed, configured, and tuned.   Customers are delivered a fully packaged appliance solution that just works. All they have to do is plug the appliance in and start integrating their specific data into the solution. KEY POINT Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.   TALK TRACK We have a flexible choice of hardware vendors – there’s no lock-in to hardware that may not fit your exact needs and may also require unnecessarily expensive hardware due to lack of choice. Operating a Big Data analytics platform can be as simple as this. Avoid proprietary hardware lock-ins like others try to sell you, and rely on basic industry standard components instead. The Microsoft Analytics Platform System is available with the flexibility to choose their preferred hardware from Dell, HP or Quanta, and each hardware choice has been designed, engineered, and tuned to perform optimally.
  7. 8 tables (8 filegroups since 1 filegroup per table). Each filegroup is made up of 2 physical files. Each scale unit has two compute nodes, so 16 filegroups so 32 files. Since each unit has 32 cpu cores, so 1 core for each file. Want high cardinality for distribution key PDW, distributes a single large logical table across 8 tables across the each server. The distribution is performed by selecting a column in each table and applying a hash function to it.
  8. Partition 2 years by day = 2,568,493 40 servers * 8 tables =320 tables This horizontal partitioning breaks the table up into 8 partitions per compute node.  Each of these distributions (essentially a table in and of itself) have dedicated CPU and disk that is the essence of Massively Parallel Processing in APS. There are 8 internal disks per compute node.
  9. 1TB drive: 15TB uncompressed per unit (2 nodes), 60TB uncompressed per rack (4 units, 8 nodes), 420TB uncompressed for 7 racks (28 units, 56 nodes)3TB drive: 45TB uncompressed per unit (2 nodes), 180TB uncompressed per rack (4 units, 8 nodes), 1260TB uncompressed for 7 racks (28 units, 56 nodes)[see slide 125]tempdb, log, and overhead of formatting the drives, storage spaces, etc have already been subtracted out (about 47%) of the 70 1TB drives (4 hot spares, 2 for fabric storage, 32 for raid 1, so 32 drives with unique data for 32TB per scale unit), that gives you 15TB of 'usable' space on a 1/4 rack, apply a 5:1 compression ratio, you get 75TBHP ProLiantDL360p Gen8 Server, 256GB RAM, 1UEach server has 2 processors (E5-2690 “Sandy Bridge” 2.90 GHz, 20M cache) with 8 cores, so 16 cores each serverSixteen (16) HP 16GB (2R x 4) PC3-12800R (DDR3-1333) MemoryTwo (2) internal HP 600GB 6G SAS 10K 2.5inPaired with 1 HP D6000 high density storage enclosures (70 HDD (7.2K) of either 1, 2, or 3 TB capacity) connected to each server through an H221 SAS HBA, 5U, 6Gb/s I usually use the word "conservative" when I'm talking about a 5:1 ratio. I also generally mention that most of the others in the industry use generally use the same number
  10. Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry.  Enterprise-ready Big Data: Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data Warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey ApplianceAPS integrates PDW and HDInsight to operate seamlessly together in a single applianceIntegrated Querying across All Data Types Using T-SQLPolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training.Enterprise-Ready HadoopHDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User Native Microsoft BI integration within PolyBaseallows everyone access to insights through familiar tools such as SSAS and Excel
  11. Key goal of slide: Communicate what Big Data isSlide talk track:ERP, SCM, CRM, and transactional web applications are classic examples of systems processing transactions. Highly structured data in these systems is typically stored in SQL databases. Web 2.0 is about how people and things interact with each other or with your business. Web logs, user clickstreams, social interactions and feeds, and user-generated content are classic places to find interaction data. Big Data is the explosion of data volume and types inside and outside the business too large for traditional systems to manage. There are multiple types of data, including personal, organizational, public, and private.  More Important, Big Data is changing how the business uses data from historical analysis to predictive analytics. Enterprises are using data in more progressive and higher value applications. These uses and applications are changing how data must be stored, managed, analyzed and accessed in order to provide not just the historical and insight analysis of the current data warehouse, but the predictive analytics and forecasting needed to stay competitive in the current marketplace.
  12. Key goal of slide: Communicate what Hadoop isSlide talk track:Everyone has heard of Hadoop. But what is it? And do I need it? Apache Hadoop is an open-source solution framework that supports data-intensive distributed applications on large clusters of commodity hardware. Hadoop is composed of a few parts:HDFS – Hadoop Distributed File System is Hadoop’s file-system which stores large files (from gigabytes to terabytes) across multiple machinesMapReduce – is a programming model that performs filtering, sorting and other data retrieval commands across a parallel, distributed algorithm. Other parts of Hadoop include Hbase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper which are all parts of the Hadoop ecosystem that all perform other functions to supplement.
  13. Key goal of slide: Communicate conceptually how companies are managing Big Data in current data warehouse environments. This shows both setting up a side by side Hadoop and ETL data into existing data warehouse. Slide talk track:Many companies have responded to the explosion of Big Data by setting up side-by-side Hadoop ecosystems. However, these companies are learning the limitations of this approach, including a steep learning curve of MapReduce and other Hadoop ecosystem tools the cost of installing, maintaining, and tooling side-by-side ecosystems to support two separate query models. Many Hadoop solutions do not integrate into enterprise or other data warehouse systems creating complexity and cost and slowing time to insights. Some Hadoop solutions feature vendor lock-in creating long term obligations Other companies set up costly extract, transform and load (ETL) operations to move non-relational data directly into the data warehouse. This requires IT to modify or create new data schema for all new data which is also time consuming and costly. As a result, performance is degraded, and it is often more expensive to integrate new data, build new applications, or access key BI insights.
  14. Key goal of slide: Communicate what Hadoop is.Slide talk track:HDInsight is an enterprise-ready, Hadoop-based distribution from Microsoft that brings a 100% Apache Hadoop solution to the data warehouse. APS gives customers Hadoop with the simplicity of a single appliance, and Microsoft integrates Hadoop data processing directly into the architecture of the appliance for optimum performance. HDInsight node has ‘shared nothing’ access to CPU, memory and storage HDInsight for APS is the most enterprise-ready Hadoop distribution in the market. HDInsight offers enterprise-class security, scalability and manageability. Thanks to a dedicated Secure Node, HDInsight helps you secure your Hadoop cluster. HDInsight also simplifies management through System Center, and organizations can provide multiple users simultaneous access to HDInsight within the appliance deploys with Active Directory.
  15. This diagram illustrates the basic layout of the direct-to-fabric Hadoop region alongside a data warehouse region designed for the APS appliance and Windows Azure. Each region provides a boundary for workload, security, metering, and servicing.HDInsight is a Hadoop region that sits over the fabric of the appliance alongside the PDW region for processing. Both regions take advantage of PolyBase as a shared query and processing model, which results in exceptional performance improvements across every node. Based on the Hortonworks 1.0 HDFS, the new HDI (HDInsights) region within APS is a dedicated Hadoop region that sits directly on top of the fabric layer of the appliance to share metered resources with the APS engine and process Hadoop cluster data. In some aspects this transforms APS into a concurrent relational and Hadoop engine, resulting in much better performance. An appliance can be configured to support relational queries only (excluding the HDI region), be configured to provide a Hadoop-only node, or be configured to support both relational and Hadoop from a single appliance. In addition, HDInsight enables the processing of Hadoop data in place, without the need for expensive ETL (extract, transform, and load). By taking advantage of Azure Storage Vault blobs, HDInsight can even extend the storage of the traditional data warehouse into the cloud. Technically, adding one or more scale units of hdi to an all APS rack is "add region" which is supported.  Adding one or more scale units of hdi to a rack with hdi already in it is "add capacity/unit" and is not supported for AU1.
  16. Key goal of slide: PolyBase is available only within the Microsoft Analytics Platform System. Slide talk track:PolyBase simplifies this by allowing Hadoop data to be queried with standard Transact-SQL (T-SQL) query language without the need to learn MapReduce and without the need to move the data into the data warehouse. PolyBase unifies relational and non-relational data at the query level.Integrated query: PolyBase accepts a standard T-SQL query that joins tables containing a relational source with tables in a Hadoop cluster referencing a non-relational source. It then seamlessly returns the results to the user. PolyBase can query Hadoop data in other Hadoop distributions such as Hortonworks or Cloudera. No difficult learning curve: Standard T-SQL can be used to query Hadoop data. Users are not required to learn MapReduce to execute the query. Cloud-Hybrid Scenario Options PolyBase can also query across Windows Azure HDInsight, providing a Hybrid Cloud solution to the data warehouseThe ability of querying all of your company’s data, independent of where it resides, what format it is stored in, in a performing way is crucial in today’s data-centric world with massive, increasing data volume. Today, with AU1, one can query various Hadoop distributions + data stored in Azure. For example, with one single T-SQL statement a user can query over data stored in multiple HDP 2.0 clusters, combine it with data in PDW and combine it with data stored in Azure.  No one in the industry (as far as I’m aware of) can do this in this simple fashion. Bringing all Microsoft assets together, on-prem and specifically through our Azure play including various services that will be brought online in future, we can clearly distinguish through our unique & complete end-to-end data management story.   No doubt that there are several pieces missing in our ‘Poly’ vision – including supporting other data stores, enabling push-down computation for our cloud story, more user-definable options language-wise, better automation/polices, and many more ideas we’d like to go after in the next weeks & months ahead.
  17. HDInsights benefits: Cheap, quickly procureKey goal of slide: Highlight the four main use cases for PolyBase.Slide talk track:There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop.PolyBase leverages the APS MPP architecture along with optimizations like push-down computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first.PolyBase is a great tool for archiving older or unused data in APS to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL.There are times when you need to share your PDW with Hadoop users and PolyBase makes it easy to copy data to a Hadoop cluster.Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes.
  18. Big Data adds value to the business when it is accessible to BI users with tools that are easy to use and consume for IT and business users alike. While some Hadoop solutions provide BI tools, or require customers to find 3rd party BI solutions these often result in a low adoption rate due to learning curves. Surveys from Gartner,The BI Survey, and Intelligent Enterprisehave found abysmal BI adoption of current solutions (~8%) due to complaints of the complexity of the tools and the cost of the solution. The BI solution must be provided to users in tools they already know and can consume.APS is the only data warehouse and Hadoop solution that has native end-end Microsoft BI integration withPolyBase, allowing users to use create new insights themselves by using tools they already know; Every Microsoft BI client, SSAS, SSRS, PowerPivot, and PowerView, has native integration with APS and has ubiquitous connectivity across the entire SQL Server ecosystem. With native BI integration, Microsoft is unique in offering an end-to-end Big Data solution where there are no barriers in the journey from acquiring raw data of all types to displaying high value insights to all users. By providing the customer with an HDInsight region in APS, with PolyBase for querying and joining any type of data in T-SQL, and by democratizing access to data insight through familiar BI tools, Microsoft is prepared to provide Big Data insights to a any user.
  19. Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. Next-generation performance at scale: APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insightsrequirements.Scale-Out to accommodate your Growing DataAPS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-outRemove DW bottlenecks with MPP SQL ServerGet the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-MemoryProvides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstoreConcurrency that Supports High AdoptionScales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads.
  20. Today, if you are not using a MPP scale-out appliance, most likely your data warehouse is built on the traditional scale-up, SMP architecture and organized as row stores. A scale-up solution runs queries sequentially on a shared-everything architecture. This essentially means that everything is processed on a single box that shares memory, disk, I/O operations, and more. To get more scale in a scale-up solution, you need to acquire a more powerful hardware box every time. You will not be able to add more hardware to the existing rack solution. A scale-up solution also has diminishing returns after a certain scale. Rowstore stores data in traditional tables as rows. The values comprising one row are stored contiguously on a page. Rowstores are sometimes not optimal for many queries that are being issued to the data warehouse, because the query returns the entire row of data—including fields that might not be needed as part of the query.The combination of scale-up SMP and rowstores are common limitations to existing warehouses that affect performance.
  21. Key goal of slide: Communicate that the Microsoft Modern Data Warehouse can scale out to petabytes of relational data.Slide talk track:SQL Server 2012 APS is a scale out, Massively Parallel Processing (MPP) architecture that represents the most powerful distributed computing and scale. This type of technology powers supercomputers to achieve raw computing horsepower. As more scale is needed, more resources can be added to scale out to the largest data warehousing projects. APS uses a shared-nothing architecture where there are multiple physical nodes, each running its own instance of SQL Server with dedicated CPU, memory, and storage. As queries go through the system, they are broken up  to run simultaneously over each physical node. The benefit is in the highest performance at scale through parallel execution. You need only to add new resources to continually scale out this implementation.This means if you also have high concurrency and complex queries at scale, APS can handle these queries with ease. This also means that APS can be optimized for “mixed workload” and “near-real time” data analysis. Enjoy faster data loading and more than two terabytes per hour.Other benefits of scale-out technologies:Start small and scale out to petabytes of dataOptimized for “mixed workload” and “near-real time” data analysisSupport for high concurrencyQuery while you loadNo hardware bottlenecksNo “forklifting” when you want to scale your systemScale not only for data size but for faster queries
  22. Key goal of slide: Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.TODO: for parallel query execution – explain difference from SMP. Slide talk track:The biggest issue with traditional data warehouses is that data is stored in rows. The values comprising one row are stored contiguously on a page. Rowstores are not optimal for many queries that are issued to the data warehouse, because the query will return the entire row of data—including fields that might not be needed as part of the query.By changing the primary storage engine to a new, updateable version of in-memory columnstore, data is grouped and stored one column at a time. The benefits to doing this are as follows:Only the columns needed must be read. Therefore, less data is read from disk to memory and later moved from memory to processor cache.Columns are heavily compressed, which reduces the number of bytes that must be read and moved.Most queries do not touch all columns of the table. Therefore, many columns will never be brought into memory. This, combined with excellent compression, improves buffer pool usage—which reduces total I/O. The result is massive compression (sometimes as much as 10x), as well as massive performance gains (as much as 100x). Use of columnstore also leverages your existing hardware instead of requiring you to purchase a new appliance. New in SQL Server 2012 APS and SQL Server 2014: Updateable and clustered columnstoreSQL Server 2012 APS and SQL Server 2013 uses the new updateable columnstore. Updates and direct bulk load are fully supported, which simplifies and accelerates data-loading and enables real-time data warehousing and trickle loading. Using columnstore also can save roughly 70% on overall storage space if you chose to eliminate the row store copy of the data entirely.
  23. Key goal of slide: Explain the limitations of serial processing SMP architecture to high concurrency MPP.High performance ad hoc analytic queries Pull insights simultaneously throughout the day Run multiple types of Queries simultaneously Run multiple types of workloads together with no tuning required High concurrency means high availability which means higher adoption Slide talk track:With the explosion of data and the growth of end-users demanding real-time insights, data warehouses are not only growing in resources but also growing in the number of users frequently accessing the data warehouse. A modern data warehouse needs to be able to both scale-out to query results quickly, but it also needs to be able to run mixed workloads all at the same time.Mixed workloads refer to concurrency.  Under concurrency, multiple types of queries are submitted, along with data loads and ELT processing. Under mixed workload scenarios, which organizations are certain to face, APS runs concurrent queries with little or no tuning. Organizations no longer have to worry about the types of workloads being run at any given time, and Microsoft APS can handle many users pulling insights simultaneously throughout the day.
  24. Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. Optimal architecture: More than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:APS Provides the Industry’s Lowest DW Price/TBLower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of APS storage with up to 15x compression via updateable in-memory columnstoreValue through Single Appliance SolutionReduce hardware footprint by having PDW and HDInsight within a single applianceRemove the need for costly integration effortsValue through Flexible Hardware OptionsAvoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
  25. EMC Greenplum, Teradata, Oracle Exadata, HP Vertica, and IBM NetezzaKey goal of slide: Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.Slide talk track:Value through Software Innovation/Hardware CommoditizationMore than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:Through Storage Spaces, APS has the performance, reliability, and scale for storage built-in to the software, allowing it to replace the SAN with a more economical high density disk option. This results in large capacity at low cost with no reduction in performance. HyperV virtualization and hardware design minimizes the hardware footprint and cost of the appliance, enabling high availability as simply as possible. Microsoft lowers the cost by reducing the hardware footprint through virtualization, providing Storage Spaces to replace expensive SAN storage, and compressing up to 15x to lower storage usage. These features help give APS the lowest relational data warehouse price/terabyte over any other company by a significant margin (~2x lower than market). The overall market’s comparable price/terabyte ranges from $8-13K/TB. For example, Oracle announced Exadata and a form factor of a 1/8th rack that costs $200K. However, this is only the hardware costs and does not include software prices, which can cost significantly more – hundreds of thousands to a million dollars. Even accounting for Oracle’s 10x compression, APS has a price/terabyte that is about half Oracle’s list price for their normal drive sizes (non-high capacity).IBM PureData Pricing: < $500,000 for quarter rack (8TB uncompressed) http://www.theregister.co.uk/2012/10/10/ibm_puredata_database_appliances/ @ 4x compression (=12 to 15K/TB)Oracle Exadata Pricing: HW pricing (1.1M)—http://www.oracle.com/us/corporate/pricing/exadata-pricelist-070598.pdf; SW pricing (7.2M)—http://www.oracle.com/us/corporate/pricing/technology-price-list-070617.pdf @ (100TB uncompressed) @ 10x compression (= 8K/TB)EMC Greenplum Pricing: $1,000,000 for half rack (18TB uncompressed) http://www.informationweek.com/software/information-management/emc-intros-backup-savvy-greenplum-applia/227701321 @4x compression (=13.8k/TB)Pricing analysis was done on the last-know publicly accessible information available, and represents the current view of Microsoft Corporation as of the date of this presentation. Because companies respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided outside of the sources cited or after the date of this presentation. Source: Value Prism Consulting: Microsoft’s SQL Server Parallel Data Warehouse Provides High Performance and Great Value; website: http://www.valueprism.com/resources/resources/ResourceDetails.aspx?ID=100
  26. Windows Server 2012 and Windows Azure Virtual Machines offer full virtualization services for both on-premises or on-demand installations.General detailsAll hosts run Windows Server 2012 Standard and Windows Azure Virtual MachinesFabric or workload in Hyper-V virtual machinesFabric virtual machine, MAD01, and CTL share one server. You get lower overhead costs especially for small topologies.APS agent runs on all hosts and all virtual machines and collects appliance health data on fabric and workloadDWConfig and Admin Console continue to exist with minor extensions to expose host level informationWindows Storage Spaces and Azure Storage blobs, enabling use of lower cost DAS (JBODs)APS workload detailsSQL Server 2012 Enterprise Edition (APS build) Control node and compute nodes for APS workloadStorage detailsMore files per file groupUses larger number of spindles in parallel
  27. Key goal of slide: APS was built to scale to handle the highest data requirements, the newest data types stored in Hadoop, and deliver the performance that meets today’s near real-time requirements. Slide talk track: A modern data warehouse is progressive, meeting broad needs and requirements:Hadoop integrates and operates seamlessly with your relational data warehousesData easily queried by SQL users without additional skills or trainingEnterprise-ready, meaning it is secure and easily managed by ITInsights accessible to everyoneThe Microsoft Analytics Platform System (APS)– the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Analytics Platform System Appliance (APS), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and non-relational data is PolyBase, Microsoft’s integrated query tool available only in APS.
  28. Data capacity requirement variable from smallest (15 terabytes) to largest (6 petabytes) with 5:1 compression (1.2 petabytes uncompressed).  From 1/4 rack up to 7 racksdata loading speed: ideal: 175Gb/hour per node (8 nodes would give 1 TB/hour), have seen 250Gb/hr, 10-20x fasterdata compression: 3x-15x, but 5x is conservative number.  Unique compression because of distribution across compute nodesquery performance: 10x-100x, reasonable linear increase with more racks