The Microsoft Analytics Platform System (APS) is a turnkey appliance that provides a modern data warehouse with the ability to handle both relational and non-relational data. It uses a massively parallel processing (MPP) architecture with multiple CPUs running queries in parallel. The APS includes an integrated Hadoop distribution called HDInsight that allows users to query Hadoop data using T-SQL with PolyBase. This provides a single query interface and allows users to leverage existing SQL skills. The APS appliance is pre-configured with software and hardware optimized to deliver high performance at scale for data warehousing workloads.
6. Are you using or going to
use “Big Data” and/or
“Hadoop”
No or limited access to
detailed data; can only
surface reports and cannot
ask ad-hoc questions.
Slow data loading
performance cannot keep
up with the need for data
from transactional systems
for intraday reporting.
MOLAP cube processing
and data refresh take too
long.
Slow query performance
with need for constant
tuning, especially with SAN
storage.
High cost of SAN storage
chargeback.
7. Keep legacy
investment
Buy new tier one
hardware appliance
Acquire big data
solution (Hadoop)
Acquire business
intelligence solution
Roadblocks to evolving to a modern data warehouse
Limited
scalability & ability to
handle new data types
Significant training
& still siloed
High acquisition/
migration
costs & no Hadoop
Complex with low
adoption
Solution and issue with that solution
8. Introducing the Microsoft Analytics Platform System
Your turnkey modern data warehouse appliance
• Relational and non-relational data in
a single appliance
• Or, integrate relational data with
non-relational data in an external
Hadoop cluster on premise or data
stored in the Cloud (hot, warm, cold)
• Enterprise-ready Hadoop
• Integrated querying across Hadoop
and APS using T-SQL (PolyBase)
• Direct integration with Microsoft BI
tools such as Power BI
• Near real-time performance with In-
Memory
• Scale-out to accommodate your
growing data or to increase
performance (2-nodes to 56-nodes)
• Remove SMP DW bottlenecks with
MPP SQL Server
• No rip and replace when more
performance needed
• No performance tuning required
• Concurrency that fuels rapid
adoption
• Industry’s lowest DW price/TB
• Value through a single appliance
solution
• Value with flexible hardware options
using commodity hardware
• Free up space on SAN (cost averages
10k per TB)
10. Hardware and software engineered together
The ease of an appliance
Co-engineered
with HP, Dell, and
Quanta best
practices
Leading
performance with
commodity
hardware
Pre-configured,
built, and tuned
software and
hardware
Integrated
support plan with
a single Microsoft
contactPDW
HDInsight
PolyBase
11. APS History
• DatAllegro started in 2003
• Microsoft acquires DatAllegro in September 2008
• PDW released in December 2010 (version 1)
• Version 2 made available in March, 2013 (PolyBase introduced)
• AU1 released in April 2014. Renamed from Parallel Data Warehouse (PDW) to Analytics Platform System (APS). It
still includes the PDW region as well as a new HDInsights/Hadoop region
• AU2 was released in July 2014
• AU3 released in October 2014
There will be AU updates every 3-4 months.
NOTE: This is a Data Warehouse solution and not an OLTP (online transaction processing) solution.
Case studies: Go to https://customers.microsoft.com and enter "parallel data warehouse" (old name) in the keyword
box and search the results, then enter "analytics platform system“ (new name)
12. Parallelism
• Uses many separate CPUs running in parallel to execute a single program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel
Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• All SQL Server implementations up until now have been SMP
• Mostly, the solution is housed on a shared SAN
SMP - Symmetric
Multiprocessing
13. APS Logical Architecture (overview)
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
DMS
DMS
DMS
DMS
Compute Node – the “worker bee” of APS
• Runs SQL Server 2014 APS
• Contains a “slice” of each database
• CPU is saturated by storage
Control Node – the “brains” of the APS
• Also runs SQL Server 2014 APS
• Holds a “shell” copy of each database
• Metadata, statistics, etc
• The “public face” of the appliance
Data Movement Services (DMS)
• Part of the “secret sauce” of APS
• Moves data around as needed
• Enables parallel operations among the compute
nodes (queries, loads, etc)
“Control” node
SQL
DMS
14. APS Logical Architecture (overview)
“Compute” node Balanced storage
SQL“Control” node
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
DMS
DMS
DMS
DMS
DMS
1) User connects to the appliance (control node)
and submits query
2) Control node query processor determines best
*parallel* query plan
3) DMS distributes sub-queries to each compute
node
4) Each compute node executes query on its
subset of data
5) Each compute node returns a subset of the
response to the control node
6) If necessary, control node does any final
aggregation/computation
7) Control node returns results to user
Queries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger
15. APS Data Layout Options
“Compute” node Balanced storage
SQL
Balanced storage
Balanced storage
Balanced storage
“Compute” node
SQL
“Compute” node
SQL
“Compute” node
SQL
DMS
DMS
DMS
DMS
Time Dim
Date Dim ID
Calendar Year
Calendar Qtr
Calendar Mo
Calendar Day
Store Dim
Store Dim ID
Store Name
Store Mgr
Store Size
Product Dim
Prod Dim ID
Prod Category
Prod Sub Cat
Prod Desc
Customer Dim
Cust Dim ID
Cust Name
Cust Addr
Cust Phone
Cust Email
Sales Fact
Date Dim ID
Store Dim ID
Prod Dim ID
Cust Dim ID
Qty Sold
Dollars Sold
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
SalesFact
Replicated
Table copied to each compute node
Distributed
Table spread across compute nodes based on “hash”
Star Schema
17. APS – Balanced across servers and within
41
Largest Table 600,000,000,000
Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000
In each server randomly distributed to 8 tables (so 320 total tables) 1,875,000,000
Each partition – 2 years data partitioned by week (benefiting queries by date) 18,028,846
As an end user or DBA you think about 1 table: LineItem.
“Select * from LineItem” is split into 320 queries running in parallel against 320 (1.875b row) tables.
“Select * from LineItem where OrderDate = ‘1/1/2014’ is 320 queries against 320 (18m row) tables.
You don’t care or need to know that there are actually 320 tables representing your 1 logical table.
CCI can add further performance via segment elimination.
22. What is Hadoop?
Microsoft Confidential
Distributed, scalable system on commodity HW
Composed of a few parts:
HDFS – Distributed file system
MapReduce – Programming model
Other tools: Hive, Pig, SQOOP, HCatalog, HBase,
Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie,
ZooKeeper, Flume, Storm
Main players are Hortonworks, Cloudera, MapR
WARNING: Hadoop, while ideal for processing huge
volumes of data, is inadequate for analyzing that
data in real time (companies do batch analytics
instead)
Core Services
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
OOZIE
AMBARI
YARN
MAP
REDUCE
HIVE &
HCATALOG
PIG
HBASEFALCON
Hadoop Cluster
compute
&
storage . . .
. . .
. .
compute
&
storage
.
.
Hadoop clusters provide
scale-out storage and
distributed data processing
on commodity hardware
23. Move HDFS into the warehouse before analysis
ETL
Learn new skills
TSQL
Build
Integrate
Manage
Maintain
Support
Complex query and analysis with big data today
Steep learning curve, slow and inefficient
Hadoop ecosystem
“New” data sources
“New” data sources“New” data sources
24. APS delivers enterprise-ready Hadoop with HDInsight
Manageable, secured and highly available Hadoop integrated into the appliance
High performance
tuned within the
appliance
End-user
authentication with
Active Directory
Accessible insights
for everyone with
Microsoft BI tools
Managed and
monitored using
System Center
100% Apache
Hadoop
SQL Server
Parallel Data
Warehouse
Microsoft
HDInsight
PolyBase
Leverage your
existing TSQL skills
Additional features over a separate Hadoop cluster
Plus one support contact still!
25. Parallel Data Warehouse
region
HDInsight region
Fabric
Hardware
Appliance
A region is a logical container within an
appliance
Each workload contains the following
boundaries:
• Security
• Metering
• Servicing
APS appliance overview
26. Select… Result set Provides a single T-SQL query model (“semantic
layer”) for APS and Hadoop with rich features of
T-SQL, including joins without ETL
Uses the power of MPP to enhance query
execution performance
Supports Windows Azure HDInsight to enable
new hybrid cloud scenarios
Provides the ability to query non-Microsoft
Hadoop distributions, such as Hortonworks and
Cloudera
Use existing SQL skillset, no IT intervention
Query Hadoop data with T-SQL using PolyBase
Bringing the worlds or big data and the data warehouse together for users and IT
SQL Server
Parallel Data
Warehouse
Cloudera CHD Linux 5.1
Hortonworks HDP 2.2
(Windows, Linux)
Windows Azure
HDInsight (HDP 2.2)
(WASB)
PolyBase
Microsoft
HDInsight
HDP 2.0
Others (SQL Server, DB2, Oracle)?
True federated query engine
27. Use cases where PolyBase simplifies using Hadoop data
Bringing islands of Hadoop data together
High performance queries against Hadoop data
(Predicate pushdown)
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)
28. Big data insights for anyone
Native Microsoft BI integration to create new insights with familiar tools
Tools like Power
BI minimize IT
intervention for
discovering data
T-SQL for DBA
and power
users to join
relational and
Hadoop data
Hadoop tools
like map-
reduce, Hive
and Pig for data
scientists
Leverages high
adoption
of Excel, Power
View, Power
Pivot, and SSAS
Power Users
Data Scientist
Everyone else using
Microsoft BI tools
30. Scale-out Massively Parallel Processing (MPP) parallelizes
queries (speed-driven not just capacity-driven)
Multiple nodes with dedicated CPU, memory,
storage “shared-nothing”
Incrementally add HW for near-linear scale to
multi-PB (no need to delete older data, stage)
Handles query complexity and concurrency at scale
No “forklift” of prior warehouse to increase capacity
Start small with a few terabyte warehouse
Mixed workload support: Query while you load
(250GB/hour per node). No need for maintenance
window
Scale-out technologies in the Analytics Platform System
91
PDW
0TB 6PB
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
31. • Store data in columnar format for massive
compression
• Load data into or out of memory for next-
generation performance
• Updateable and clustered for real-time trickle
loading
• No secondary indexes required
92
Up to 100x
faster queries
Updatable clustered columnstore vs. table with customary indexing
Up to 15x
more compression
Columnstore index representation
Parallel query execution
Query
Results
32. Investment firm Before/After Results - HP
SMP vs APS
21x
improvement
loading data
(7:30 minutes vs 21
seconds)
62x
improvement
staging to
landing (30
minutes vs 29
seconds)
17x, 166x, 169x
query
performance
improvement
(1:05 hour vs 23
seconds)
Microsoft BI
tools work
unchanged
1.1 TB/hr loading
time, 8.8x
compression (2
billion rows)
(472GB to 53GB)
46x
improvement
creating
datamart (70
minutes vs 1:31
minutes)
33. BI Tools
Reporting and cubes
SQL Server SMP (Spoke)
Concurrency that fuels rapid adoption
Great performance with mixed workloads
Analytics Platform System
ETL/ELT with SSIS, DQS, MDS
ERP CRM LOB APPS
ETL/ELT with DWLoader
Hadoop / Big Data
PDW
HDInsight
PolyBase
Ad hoc queries
Intra-Day
Near real-time
Fast ad hoc
Columnstore
Polybase
CRTAS
“Link Table”
Real-Time
ROLAP / MOLAP
DirectQuery
SNAC
34. Stream Analytics
TransformIngest
Example overall data flow and Architecture
Web logs
Present &
decide
IoT, Mobile Devices
etc.
Social Data
Event Hubs HDInsight
Azure Data
Factory
Azure SQL DB
Azure Blob Storage
Azure Machine
Learning
(Fraud detection
etc.)
Power BI
Web
dashboards
Mobile devices
DW / Long-term
storage
Predictive
analytics
Event & data
producers
Analytics Platform Sys.
36. APS provides the industry’s lowest DW appliance price/TB
Reshaped hardware specs through software innovation
Price per terabyte for leading vendors (Sept 2014)
Significantly lower price
per TB than the closest
competitor
Lower storage costs
with Windows Server 2012
Storage Spaces
Small cost gap between multiple clustered HP DL980's with SAN vs APS 1/4 rack
$-
$20
$40
$60
$80
$100
$120
$140
Oracle Pivotal IBM Teradata Microsoft
Thousands
TCO per TB (uncompressed):
37. Virtualized architecture overview
Host 2
Host 1
Host 3
Host 4
Economical
disk storage
IB and
Ethernet
Direct attached SAS
Base Unit
CT
L
M
AD
A
D
V
M
M
Compute 2
Compute 1
• APS engine
• DMS Manager
• SQL Server 2012 Enterprise Edition (APS build) (AU3: SQL 2014)
Software details
• All hosts run Windows Server 2012 Standard (AU3:
2012 R2) and Windows Azure Virtual Machines
• Fabric or workload in Hyper-V Virtual Machines
• Fabric virtual machine, management server (MAD01),
and control server (CTL) share one server
• APS agent that runs on all hosts and all virtual
machines
• DWConfig and Admin Console
• Windows Storage Spaces and Azure Storage blobs
• Does not require expertise in Hyper-V or Windows
38. APS High-Availability
X
X
Compute
Host 1
Compute
Host 2
XControl Host
Failover Host
Infiniband1
Ethernet1
Infiniband2
Ethernet2
XXXFAB AD VMM MAD CTL
Compute 2 VM
Compute 1 VMCompute 1 VMInfiniband1
Ethernet1
• No Single Point-Of-Failure
• No need for SQL Server Clustering
39. Less DBA Maintenance/Monitoring
• No index creation
• No deleting/archiving data to save space
• Management simplicity (System Center, Admin console, DMVs)
• No blocking
• No logs
• No query hints
• No wait states
• No IO tuning
• No query optimization/tuning
• No index reorgs/rebuilds
• No partitioning
• No managing filegroups
• No shrinking/expanding databases
• No managing physical servers
• No patching servers and software
RESULT: DBA’s spend more of their time as architects and not baby sitters!
40. The no-compromise modern data warehouse solution
Microsoft’s turn-key modern data warehouse appliance
Analytics Platform System
Microsoft
• Improved query performance
• Faster data loading
• Improved concurrency
• Less DBA maintenance
• Limited training needed
• Use familiar BI tools
• Ease of appliance deployment
• Mixed workload support
• Improved data compression
• Scalability
• High availability
• PolyBase
• Integration with cloud-born data
• HDInsight/Hadoop integration
• Data warehouse consolidation
• Easy support model
Summary of Benefits
Bold = benefits of APS over upgrading to SQL Server 2014, no worry about future hardware roadblocks
42. Enterprise-ready big data – cloud
enabled
• Improved PolyBase Support
• Cloudera 5.1 Support
• Partial Aggregate Pushdowns
• Expanding Big Data capacity
• Grow HDInsight region on an appliance
with an existing region
Next-gen performance & engineered
for optimal value
• 1.5X data return rate for SELECT * queries
• Streaming large data sets for external apps
(e.g., SSAS, SAS, R, etc.)
Next-gen performance &
engineered for optimal value
• TSQL Compatibility
• Scalar UDFs (CREATE Function)
• SQL Server SMP to APS (SQL Server
MPP) Migration Utility
• Bulk load / BCP through SQL Server
command-line tools
• OEM Hardware Refresh (HP Gen 9)
• HP ProLiant DL360 Gen9 Server w/2x
Intel Haswell Processors, 256 GB
(16x16Gb) 2133MHz memory
• HP 5900 series switches (HA
improvements)
Symmetry between DW On-Prem
and Azure
T-SQL Compat:
Appliance Hardware
Editor's Notes
Key goal of slide: To convey what every IT person knows: The data warehouse and what’s it for. Then we set-up the Gartner quote to say that there is a tipping point. End the slide with a question: Why is it at a tipping point?
Slide talk track:
What is the “traditional” data warehouse?
IT professionals know this well. A data warehouse or an enterprise data warehouse is a database that was designed specifically for data analysis. It is the single source of truth or the central repository for all data in the company. This means disparate data in the company coming from your transactional systems, your ERP, CRM or Line of Business applications would all be extracted, transformed, and cleansed and put into the warehouse. It was built so that the people who is accessing the warehouse using BI tools will be accessing data that has been provisioned by IT and represent accurate data sanctioned by the company.
However, this traditional data warehouse is reaching an inflection point. Gartner in their analysis of the state of data warehousing noted that it is reaching the most significant tipping point since it’s inception. The question is why? What is going on?
Key goal of slide: To convey that the traditional data warehouse is going to break in one of four different ways. These ways should also not be a surprise to the IT professionals. At the end of the slide, IT should be asking, what can I do to prevent my warehouse from breaking?
Slide talk track:
There are many reasons why data warehouses are at it’s tipping point where something needs to change.
The first trend that will break my traditional data warehouse is data growth. Data volumes are expected to grow 10X over the next five years and traditional data warehouses cannot keep up with this explosion of data.
In addition to growing data, end users have the expectation that they’ll need be able to get back query results faster in near real-time. End users are no longer apt to wait minutes to hours for their results which is something traditional data warehouses cannot keep up with. Also, want real-time data, not dated data pulled in during a maintenance window each night
The third trend is new types of data captured that are “non-relational.” 85% of data growth is coming from “non-relational” data in the form of things like web logs, sensor data, social sentiment and devices. You’ve probably heard the term “Big Data” and “Hadoop” quite a bit. This is where these technologies come into play. More on that later….
The final trend that is appearing is cloud born data. This is data that might be coming from some of IT’s infrastructure that they are starting to host in the cloud (ie. CRM, ERP, etc) or not stored by any type of corporate owned system. How do you incorporate both on-premise and cloud data as part of your data warehouse? This is the last trend that is breaking the traditional data warehouse.
However, this traditional data warehouse is reaching an inflection point. Gartner in their analysis of the state of data warehousing noted that it is reaching the most significant tipping point since it’s inception. The question is why? What is going on?
Key goal of slide: To convey that the modern data warehouse is something that the traditional data warehouse must evolve to. To have IT agree that their warehouses need to take advantage of these new technologies (specifically focusing on the middle and bottom layer).
Slide talk track:
To encompass these four trends, we need to evolve our traditional data warehouse to ensure that it does not break. It needs to become the “modern data warehouse.” What is the “modern data warehouse?” This is the new warehouse that is able to excel with these new trends and can be your warehouse now and into the future.
The modern data warehouse has the ability to:
Handle all types of data. Whether it be your structured, relational data sources or your non-relational data sources, the Modern data warehouse will incorporate Hadoop. It can handle real-time data by using complex event processor technologies.
Provide a way to enrich your data with Extract, Transform Load (ETL) capabilities as well as Master Data Management (MDM) and data quality
Provide a way for any BI tool or query mechanism to interface with all these different types of data with a single query model that leverages a single query language that users already know (example: SQL).
Questions drive BI, Analytics drive questions
Top: solution choice, Bottom: problem if do
Key goal of slide: To convey the limitations of current modern data warehouse options in the market.
Slide talk track:
Organizations are facing the challenge of now turning to two platforms for managing their data—relational database management systems (RDBMS) for traditional data and Apache Hadoop, the most widely used open source Big Data platform for large, non-relational data.
Many Brand-new tier-one appliances are expensive. Major vendors offer tier-one RDBMS appliances. However, many of these come with a high price tag, averaging millions of dollars, and in-company politics may result into long struggles to approve and implement. Further, most of these appliances are focusing on point solutions instead of general and do not include a Hadoop solution, requiring a separate, additional appliance and ecosystem.
Hadoop solutions are complex. Vendors can provide a Hadoop solution to you as their own distribution of Hadoop or an appliance that comes pre-installed with Hadoop. The problem is that the Hadoop ecosystem requires significant training investment, and a major effort is needed to integrate the Hadoop ecosystem. There is a steep learning curve and ongoing operational cost when your IT department needs to re-orient themselves around HDFS, MapReduce, Hive, and Hbase, rather than T-SQL and a standard RDBMS design. The result is often increased cost at a time when IT is expected to streamline.
BI tools are unfamiliar. Surveys from Gartner, The BI Survey, and Intelligent Enterprise have found abysmal BI adoption of current solutions (~8%) due to complaints of the complexity of the tools and the cost of the solution. Users want tools they already know and can consume, but no vendor can deliver on all the solutions you need at a reasonable cost or in a natively-integrated manner.
Troubleshooting, support and maintenance. Keeping up with configuration changes, support and maintenance with troubleshooting is not trivial.
Today’s world of data is changing rapidly, and organizations need a modern data warehouse to adapt successfully to these changes. However, companies want the smoothest path to this transformation- a path where costs, downtime, and training are minimal, and where performance and accessibility to data insights are vastly improved.
Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points.
To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry.
Enterprise-ready Big Data: Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data Warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance.
Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS.
Your Modern Data Warehouse in One Turnkey Appliance
APS integrates PDW and HDInsight to operate seamlessly together in a single appliance
Integrated Querying across All Data Types Using T-SQL
PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training.
Enterprise-Ready Hadoop
HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center
Big Data Insights to Any User
Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel
Next-generation performance at scale: APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements.
Scale-Out to accommodate your Growing Data
APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out
Remove DW bottlenecks with MPP SQL Server
Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server.
Real-Time Performance with In-Memory
Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore
Concurrency that Supports High Adoption
Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads.
Optimal architecture: More than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:
APS Provides the Industry’s Lowest DW Price/TB
Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces
Save up to 70% of APS storage with up to 15x compression via updateable in-memory columnstore
Value through Single Appliance Solution
Reduce hardware footprint by having PDW and HDInsight within a single appliance
Remove the need for costly integration efforts
Value through Flexible Hardware Options
Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
The Analytics Platform System is a pre-built appliance that ships to your door. As an appliance, all of the hardware has been pre-built: servers, storage arrays, switches, power, racks, and more. Also, all the software has been installed, configured, and tuned.
Customers are delivered a fully packaged appliance solution that just works. All they have to do is plug the appliance in and start integrating their specific data into the solution.
KEY POINT
Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.
TALK TRACK
We have a flexible choice of hardware vendors – there’s no lock-in to hardware that may not fit your exact needs and may also require unnecessarily expensive hardware due to lack of choice.
Operating a Big Data analytics platform can be as simple as this. Avoid proprietary hardware lock-ins like others try to sell you, and rely on basic industry standard components instead. The Microsoft Analytics Platform System is available with the flexibility to choose their preferred hardware from Dell, HP or Quanta, and each hardware choice has been designed, engineered, and tuned to perform optimally.
8 tables (8 filegroups since 1 filegroup per table). Each filegroup is made up of 2 physical files. Each scale unit has two compute nodes, so 16 filegroups so 32 files. Since each unit has 32 cpu cores, so 1 core for each file.
Want high cardinality for distribution key
PDW, distributes a single large logical table across 8 tables across the each server.
The distribution is performed by selecting a column in each table and applying a hash function to it.
Partition 2 years by day = 2,568,493
40 servers * 8 tables =320 tables
This horizontal partitioning breaks the table up into 8 partitions per compute node. Each of these distributions (essentially a table in and of itself) have dedicated CPU and disk that is the essence of Massively Parallel Processing in APS. There are 8 internal disks per compute node.
1TB drive: 15TB uncompressed per unit (2 nodes), 60TB uncompressed per rack (4 units, 8 nodes), 420TB uncompressed for 7 racks (28 units, 56 nodes)3TB drive: 45TB uncompressed per unit (2 nodes), 180TB uncompressed per rack (4 units, 8 nodes), 1260TB uncompressed for 7 racks (28 units, 56 nodes)[see slide 125]tempdb, log, and overhead of formatting the drives, storage spaces, etc have already been subtracted out (about 47%) of the 70 1TB drives (4 hot spares, 2 for fabric storage, 32 for raid 1, so 32 drives with unique data for 32TB per scale unit), that gives you 15TB of 'usable' space on a 1/4 rack, apply a 5:1 compression ratio, you get 75TBHP ProLiantDL360p Gen8 Server, 256GB RAM, 1UEach server has 2 processors (E5-2690 “Sandy Bridge” 2.90 GHz, 20M cache) with 8 cores, so 16 cores each serverSixteen (16) HP 16GB (2R x 4) PC3-12800R (DDR3-1333) MemoryTwo (2) internal HP 600GB 6G SAS 10K 2.5inPaired with 1 HP D6000 high density storage enclosures (70 HDD (7.2K) of either 1, 2, or 3 TB capacity) connected to each server through an H221 SAS HBA, 5U, 6Gb/s I usually use the word "conservative" when I'm talking about a 5:1 ratio. I also generally mention that most of the others in the industry use generally use the same number
Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Enterprise-ready Big Data: Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data Warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey ApplianceAPS integrates PDW and HDInsight to operate seamlessly together in a single applianceIntegrated Querying across All Data Types Using T-SQLPolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training.Enterprise-Ready HadoopHDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User Native Microsoft BI integration within PolyBaseallows everyone access to insights through familiar tools such as SSAS and Excel
Key goal of slide: Communicate what Big Data isSlide talk track:ERP, SCM, CRM, and transactional web applications are classic examples of systems processing transactions. Highly structured data in these systems is typically stored in SQL databases. Web 2.0 is about how people and things interact with each other or with your business. Web logs, user clickstreams, social interactions and feeds, and user-generated content are classic places to find interaction data. Big Data is the explosion of data volume and types inside and outside the business too large for traditional systems to manage. There are multiple types of data, including personal, organizational, public, and private. More Important, Big Data is changing how the business uses data from historical analysis to predictive analytics. Enterprises are using data in more progressive and higher value applications. These uses and applications are changing how data must be stored, managed, analyzed and accessed in order to provide not just the historical and insight analysis of the current data warehouse, but the predictive analytics and forecasting needed to stay competitive in the current marketplace.
Key goal of slide: Communicate what Hadoop isSlide talk track:Everyone has heard of Hadoop. But what is it? And do I need it? Apache Hadoop is an open-source solution framework that supports data-intensive distributed applications on large clusters of commodity hardware. Hadoop is composed of a few parts:HDFS – Hadoop Distributed File System is Hadoop’s file-system which stores large files (from gigabytes to terabytes) across multiple machinesMapReduce – is a programming model that performs filtering, sorting and other data retrieval commands across a parallel, distributed algorithm. Other parts of Hadoop include Hbase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper which are all parts of the Hadoop ecosystem that all perform other functions to supplement.
Key goal of slide: Communicate conceptually how companies are managing Big Data in current data warehouse environments. This shows both setting up a side by side Hadoop and ETL data into existing data warehouse. Slide talk track:Many companies have responded to the explosion of Big Data by setting up side-by-side Hadoop ecosystems. However, these companies are learning the limitations of this approach, including a steep learning curve of MapReduce and other Hadoop ecosystem tools the cost of installing, maintaining, and tooling side-by-side ecosystems to support two separate query models. Many Hadoop solutions do not integrate into enterprise or other data warehouse systems creating complexity and cost and slowing time to insights. Some Hadoop solutions feature vendor lock-in creating long term obligations Other companies set up costly extract, transform and load (ETL) operations to move non-relational data directly into the data warehouse. This requires IT to modify or create new data schema for all new data which is also time consuming and costly. As a result, performance is degraded, and it is often more expensive to integrate new data, build new applications, or access key BI insights.
Key goal of slide: Communicate what Hadoop is.Slide talk track:HDInsight is an enterprise-ready, Hadoop-based distribution from Microsoft that brings a 100% Apache Hadoop solution to the data warehouse. APS gives customers Hadoop with the simplicity of a single appliance, and Microsoft integrates Hadoop data processing directly into the architecture of the appliance for optimum performance. HDInsight node has ‘shared nothing’ access to CPU, memory and storage HDInsight for APS is the most enterprise-ready Hadoop distribution in the market. HDInsight offers enterprise-class security, scalability and manageability. Thanks to a dedicated Secure Node, HDInsight helps you secure your Hadoop cluster. HDInsight also simplifies management through System Center, and organizations can provide multiple users simultaneous access to HDInsight within the appliance deploys with Active Directory.
This diagram illustrates the basic layout of the direct-to-fabric Hadoop region alongside a data warehouse region designed for the APS appliance and Windows Azure. Each region provides a boundary for workload, security, metering, and servicing.HDInsight is a Hadoop region that sits over the fabric of the appliance alongside the PDW region for processing. Both regions take advantage of PolyBase as a shared query and processing model, which results in exceptional performance improvements across every node. Based on the Hortonworks 1.0 HDFS, the new HDI (HDInsights) region within APS is a dedicated Hadoop region that sits directly on top of the fabric layer of the appliance to share metered resources with the APS engine and process Hadoop cluster data. In some aspects this transforms APS into a concurrent relational and Hadoop engine, resulting in much better performance. An appliance can be configured to support relational queries only (excluding the HDI region), be configured to provide a Hadoop-only node, or be configured to support both relational and Hadoop from a single appliance. In addition, HDInsight enables the processing of Hadoop data in place, without the need for expensive ETL (extract, transform, and load). By taking advantage of Azure Storage Vault blobs, HDInsight can even extend the storage of the traditional data warehouse into the cloud. Technically, adding one or more scale units of hdi to an all APS rack is "add region" which is supported. Adding one or more scale units of hdi to a rack with hdi already in it is "add capacity/unit" and is not supported for AU1.
Key goal of slide: PolyBase is available only within the Microsoft Analytics Platform System. Slide talk track:PolyBase simplifies this by allowing Hadoop data to be queried with standard Transact-SQL (T-SQL) query language without the need to learn MapReduce and without the need to move the data into the data warehouse. PolyBase unifies relational and non-relational data at the query level.Integrated query: PolyBase accepts a standard T-SQL query that joins tables containing a relational source with tables in a Hadoop cluster referencing a non-relational source. It then seamlessly returns the results to the user. PolyBase can query Hadoop data in other Hadoop distributions such as Hortonworks or Cloudera. No difficult learning curve: Standard T-SQL can be used to query Hadoop data. Users are not required to learn MapReduce to execute the query. Cloud-Hybrid Scenario Options PolyBase can also query across Windows Azure HDInsight, providing a Hybrid Cloud solution to the data warehouseThe ability of querying all of your company’s data, independent of where it resides, what format it is stored in, in a performing way is crucial in today’s data-centric world with massive, increasing data volume. Today, with AU1, one can query various Hadoop distributions + data stored in Azure. For example, with one single T-SQL statement a user can query over data stored in multiple HDP 2.0 clusters, combine it with data in PDW and combine it with data stored in Azure. No one in the industry (as far as I’m aware of) can do this in this simple fashion. Bringing all Microsoft assets together, on-prem and specifically through our Azure play including various services that will be brought online in future, we can clearly distinguish through our unique & complete end-to-end data management story. No doubt that there are several pieces missing in our ‘Poly’ vision – including supporting other data stores, enabling push-down computation for our cloud story, more user-definable options language-wise, better automation/polices, and many more ideas we’d like to go after in the next weeks & months ahead.
HDInsights benefits: Cheap, quickly procureKey goal of slide: Highlight the four main use cases for PolyBase.Slide talk track:There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop.PolyBase leverages the APS MPP architecture along with optimizations like push-down computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first.PolyBase is a great tool for archiving older or unused data in APS to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL.There are times when you need to share your PDW with Hadoop users and PolyBase makes it easy to copy data to a Hadoop cluster.Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes.
Big Data adds value to the business when it is accessible to BI users with tools that are easy to use and consume for IT and business users alike. While some Hadoop solutions provide BI tools, or require customers to find 3rd party BI solutions these often result in a low adoption rate due to learning curves. Surveys from Gartner,The BI Survey, and Intelligent Enterprisehave found abysmal BI adoption of current solutions (~8%) due to complaints of the complexity of the tools and the cost of the solution. The BI solution must be provided to users in tools they already know and can consume.APS is the only data warehouse and Hadoop solution that has native end-end Microsoft BI integration withPolyBase, allowing users to use create new insights themselves by using tools they already know; Every Microsoft BI client, SSAS, SSRS, PowerPivot, and PowerView, has native integration with APS and has ubiquitous connectivity across the entire SQL Server ecosystem. With native BI integration, Microsoft is unique in offering an end-to-end Big Data solution where there are no barriers in the journey from acquiring raw data of all types to displaying high value insights to all users. By providing the customer with an HDInsight region in APS, with PolyBase for querying and joining any type of data in T-SQL, and by democratizing access to data insight through familiar BI tools, Microsoft is prepared to provide Big Data insights to a any user.
Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. Next-generation performance at scale: APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insightsrequirements.Scale-Out to accommodate your Growing DataAPS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-outRemove DW bottlenecks with MPP SQL ServerGet the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-MemoryProvides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstoreConcurrency that Supports High AdoptionScales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads.
Today, if you are not using a MPP scale-out appliance, most likely your data warehouse is built on the traditional scale-up, SMP architecture and organized as row stores. A scale-up solution runs queries sequentially on a shared-everything architecture. This essentially means that everything is processed on a single box that shares memory, disk, I/O operations, and more. To get more scale in a scale-up solution, you need to acquire a more powerful hardware box every time. You will not be able to add more hardware to the existing rack solution. A scale-up solution also has diminishing returns after a certain scale. Rowstore stores data in traditional tables as rows. The values comprising one row are stored contiguously on a page. Rowstores are sometimes not optimal for many queries that are being issued to the data warehouse, because the query returns the entire row of data—including fields that might not be needed as part of the query.The combination of scale-up SMP and rowstores are common limitations to existing warehouses that affect performance.
Key goal of slide: Communicate that the Microsoft Modern Data Warehouse can scale out to petabytes of relational data.Slide talk track:SQL Server 2012 APS is a scale out, Massively Parallel Processing (MPP) architecture that represents the most powerful distributed computing and scale. This type of technology powers supercomputers to achieve raw computing horsepower. As more scale is needed, more resources can be added to scale out to the largest data warehousing projects. APS uses a shared-nothing architecture where there are multiple physical nodes, each running its own instance of SQL Server with dedicated CPU, memory, and storage. As queries go through the system, they are broken up to run simultaneously over each physical node. The benefit is in the highest performance at scale through parallel execution. You need only to add new resources to continually scale out this implementation.This means if you also have high concurrency and complex queries at scale, APS can handle these queries with ease. This also means that APS can be optimized for “mixed workload” and “near-real time” data analysis. Enjoy faster data loading and more than two terabytes per hour.Other benefits of scale-out technologies:Start small and scale out to petabytes of dataOptimized for “mixed workload” and “near-real time” data analysisSupport for high concurrencyQuery while you loadNo hardware bottlenecksNo “forklifting” when you want to scale your systemScale not only for data size but for faster queries
Key goal of slide: Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.TODO: for parallel query execution – explain difference from SMP. Slide talk track:The biggest issue with traditional data warehouses is that data is stored in rows. The values comprising one row are stored contiguously on a page. Rowstores are not optimal for many queries that are issued to the data warehouse, because the query will return the entire row of data—including fields that might not be needed as part of the query.By changing the primary storage engine to a new, updateable version of in-memory columnstore, data is grouped and stored one column at a time. The benefits to doing this are as follows:Only the columns needed must be read. Therefore, less data is read from disk to memory and later moved from memory to processor cache.Columns are heavily compressed, which reduces the number of bytes that must be read and moved.Most queries do not touch all columns of the table. Therefore, many columns will never be brought into memory. This, combined with excellent compression, improves buffer pool usage—which reduces total I/O. The result is massive compression (sometimes as much as 10x), as well as massive performance gains (as much as 100x). Use of columnstore also leverages your existing hardware instead of requiring you to purchase a new appliance. New in SQL Server 2012 APS and SQL Server 2014: Updateable and clustered columnstoreSQL Server 2012 APS and SQL Server 2013 uses the new updateable columnstore. Updates and direct bulk load are fully supported, which simplifies and accelerates data-loading and enables real-time data warehousing and trickle loading. Using columnstore also can save roughly 70% on overall storage space if you chose to eliminate the row store copy of the data entirely.
Key goal of slide: Explain the limitations of serial processing SMP architecture to high concurrency MPP.High performance ad hoc analytic queries Pull insights simultaneously throughout the day Run multiple types of Queries simultaneously Run multiple types of workloads together with no tuning required High concurrency means high availability which means higher adoption Slide talk track:With the explosion of data and the growth of end-users demanding real-time insights, data warehouses are not only growing in resources but also growing in the number of users frequently accessing the data warehouse. A modern data warehouse needs to be able to both scale-out to query results quickly, but it also needs to be able to run mixed workloads all at the same time.Mixed workloads refer to concurrency. Under concurrency, multiple types of queries are submitted, along with data loads and ELT processing. Under mixed workload scenarios, which organizations are certain to face, APS runs concurrent queries with little or no tuning. Organizations no longer have to worry about the types of workloads being run at any given time, and Microsoft APS can handle many users pulling insights simultaneously throughout the day.
Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. Optimal architecture: More than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:APS Provides the Industry’s Lowest DW Price/TBLower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of APS storage with up to 15x compression via updateable in-memory columnstoreValue through Single Appliance SolutionReduce hardware footprint by having PDW and HDInsight within a single applianceRemove the need for costly integration effortsValue through Flexible Hardware OptionsAvoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
EMC Greenplum, Teradata, Oracle Exadata, HP Vertica, and IBM NetezzaKey goal of slide: Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.Slide talk track:Value through Software Innovation/Hardware CommoditizationMore than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:Through Storage Spaces, APS has the performance, reliability, and scale for storage built-in to the software, allowing it to replace the SAN with a more economical high density disk option. This results in large capacity at low cost with no reduction in performance. HyperV virtualization and hardware design minimizes the hardware footprint and cost of the appliance, enabling high availability as simply as possible. Microsoft lowers the cost by reducing the hardware footprint through virtualization, providing Storage Spaces to replace expensive SAN storage, and compressing up to 15x to lower storage usage. These features help give APS the lowest relational data warehouse price/terabyte over any other company by a significant margin (~2x lower than market). The overall market’s comparable price/terabyte ranges from $8-13K/TB. For example, Oracle announced Exadata and a form factor of a 1/8th rack that costs $200K. However, this is only the hardware costs and does not include software prices, which can cost significantly more – hundreds of thousands to a million dollars. Even accounting for Oracle’s 10x compression, APS has a price/terabyte that is about half Oracle’s list price for their normal drive sizes (non-high capacity).IBM PureData Pricing: < $500,000 for quarter rack (8TB uncompressed) http://www.theregister.co.uk/2012/10/10/ibm_puredata_database_appliances/ @ 4x compression (=12 to 15K/TB)Oracle Exadata Pricing: HW pricing (1.1M)—http://www.oracle.com/us/corporate/pricing/exadata-pricelist-070598.pdf; SW pricing (7.2M)—http://www.oracle.com/us/corporate/pricing/technology-price-list-070617.pdf @ (100TB uncompressed) @ 10x compression (= 8K/TB)EMC Greenplum Pricing: $1,000,000 for half rack (18TB uncompressed) http://www.informationweek.com/software/information-management/emc-intros-backup-savvy-greenplum-applia/227701321 @4x compression (=13.8k/TB)Pricing analysis was done on the last-know publicly accessible information available, and represents the current view of Microsoft Corporation as of the date of this presentation. Because companies respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided outside of the sources cited or after the date of this presentation. Source: Value Prism Consulting: Microsoft’s SQL Server Parallel Data Warehouse Provides High Performance and Great Value; website: http://www.valueprism.com/resources/resources/ResourceDetails.aspx?ID=100
Windows Server 2012 and Windows Azure Virtual Machines offer full virtualization services for both on-premises or on-demand installations.General detailsAll hosts run Windows Server 2012 Standard and Windows Azure Virtual MachinesFabric or workload in Hyper-V virtual machinesFabric virtual machine, MAD01, and CTL share one server. You get lower overhead costs especially for small topologies.APS agent runs on all hosts and all virtual machines and collects appliance health data on fabric and workloadDWConfig and Admin Console continue to exist with minor extensions to expose host level informationWindows Storage Spaces and Azure Storage blobs, enabling use of lower cost DAS (JBODs)APS workload detailsSQL Server 2012 Enterprise Edition (APS build) Control node and compute nodes for APS workloadStorage detailsMore files per file groupUses larger number of spindles in parallel
Key goal of slide: APS was built to scale to handle the highest data requirements, the newest data types stored in Hadoop, and deliver the performance that meets today’s near real-time requirements. Slide talk track: A modern data warehouse is progressive, meeting broad needs and requirements:Hadoop integrates and operates seamlessly with your relational data warehousesData easily queried by SQL users without additional skills or trainingEnterprise-ready, meaning it is secure and easily managed by ITInsights accessible to everyoneThe Microsoft Analytics Platform System (APS)– the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Analytics Platform System Appliance (APS), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and non-relational data is PolyBase, Microsoft’s integrated query tool available only in APS.
Data capacity requirement variable from smallest (15 terabytes) to largest (6 petabytes) with 5:1 compression (1.2 petabytes uncompressed). From 1/4 rack up to 7 racksdata loading speed: ideal: 175Gb/hour per node (8 nodes would give 1 TB/hour), have seen 250Gb/hr, 10-20x fasterdata compression: 3x-15x, but 5x is conservative number. Unique compression because of distribution across compute nodesquery performance: 10x-100x, reasonable linear increase with more racks