As a follow-on to the presentation "Building an Effective Data Warehouse Architecture", this presentation will explain exactly what Big Data is and its benefits, including use cases. We will discuss how Hadoop, the cloud and massively parallel processing (MPP) is changing the way data warehouses are being built. We will talk about hybrid architectures that combine on-premise data with data in the cloud as well as relational data and non-relational (unstructured) data. We will look at the benefits of MPP over SMP and how to integrate data from Internet of Things (IoT) devices. You will learn what a modern data warehouse should look like and how the role of a Data Lake and Hadoop fit in. In the end you will have guidance on the best solution for your data warehouse going forward.
1. Building a Big Data solution
“Building an Effective Data Warehouse Architecture
with Hadoop, the cloud, and MPP”
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
2. Other Presentations
Building an Effective Data Warehouse Architecture
Reasons for building a DW and the various approaches and DW concepts (Kimball vs Inmon)
Building a Big Data Solution (Building an Effective Data Warehouse
Architecture with Hadoop, the cloud and MPP)
Explains what Big Data is, it’s benefits including use cases, and how Hadoop, the cloud, and MPP fit in
Finding business value in Big Data (What exactly is Big Data and why
should I care?)
Very similar to “Building a Big Data Solution” but target audience is business users/CxO instead of architects
How does Microsoft solve Big Data?
Covers the Microsoft products that can be used to create a Big Data solution
Modern Data Warehousing with the Microsoft Analytics Platform System
The next step in data warehouse performance is APS, a MPP appliance
Power BI, Azure ML, Azure HDInsights, Azure Data Factory, etc
Deep dives into the various Microsoft Big Data related products
3. About Me
Business Intelligence Consultant, in IT for 28 years
Microsoft, Big Data Evangelist
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW developer
Been perm, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference and PASS Summit
MCSE for SQL Server 2012: Data Platform and BI
Blog at JamesSerra.com
SQL Server MVP
Author of book “Reporting with Microsoft SQL Server 2012”
4. I tried building a Big Data solution…
And ended up passed-out drunk in a Denny’s
parking lot
Let’s prevent that from happening…
5. Agenda
Review of Building an Effective Data Warehouse Architecture
Overview of Big Data and Analytics
Use cases
Data Lake
Hadoop and its role
IoT and real-time data
Modern data warehouse
Federated querying
DW and the cloud
Symmetric Multiprocessing (SMP) vs. Massively Parallel Processing (MPP)
7. What is a Data Warehouse and why use one?
A data warehouse is where you store data from multiple data sources to be used for historical and trend
analysis reporting. It acts as a central repository for many subject areas and contains the "single version of
truth". It is NOT to be used for OLTP applications.
Reasons for a data warehouse:
Reduce stress on production system
Optimized for read access, sequential disk scans
Integrate many sources of data
Keep historical records (no need to save hardcopy reports)
Restructure/rename tables and fields, model data
Protect against source system upgrades
Use Master Data Management, including hierarchies
No IT involvement needed for users to create reports
Improve data quality and plugs holes in source systems
One version of the truth
Easy to create BI solutions on top of it (i.e. SSAS Cubes)
Previous presentation “Building an Effective Data Warehouse Architecture”:
http://pragmaticworks.com/Training/FreeTraining/ViewWebinar/WebinarID/532
http://www.slideshare.net/jamserra/data-warehouse-architecture-16065902
8. Why use a Data Warehouse?
Legacy applications + databases = chaos
Production
Control
MRP
Inventory
Control
Parts
Management
Logistics
Shipping
Raw Goods
Order Control
Purchasing
Marketing
Finance
Sales
Accounting
Management
Reporting
Engineering
Actuarial
Human
Resources
Continuity
Consolidation
Control
Compliance
Collaboration
Enterprise data warehouse = order
Single version
of the truth
Enterprise Data
Warehouse
Every question = decision
Two purposes of data warehouse: 1) save time building reports; 2) slice in dice in ways you could not do before
9. Data Warehouse Hybrid Model
Advice: Use SQL Server Views to interface between each level in the model
In the DW Bus Architecture, each data mart could be a schema (broken out by business process subject areas), all in one database.
Another option is to have each data mart in its own database with all databases on one server or spread among multiple servers.
Also, the staging areas, CIF, and DW Bus can all be on the same powerful server (MPP)
13. What is Big Data, really?
Data in all forms & sizes
is being generated
faster than ever before
Capture & combine it
for new insights & better,
faster decisions
16
14. Harness the growing and changing nature of data
Collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”
15. An illustration of the velocity of data created
Kalakota, R. (2012, October 22). Sizing “Mobile + Social” Big Data Stats. Retrieved from http://practicalanalytics.wordpress.com/
17. Complex implementations
Enterprise data warehouse
Spreadmarts
Siloed data
Hadoop
DashboardsAd hoc analysis
Machine learning
OLAP
Any dataIn-memory
Internet of Things
Innovation
Transactional systems
ETL
Operational reporting
Value
Technology innovation accelerates value
19. 26
Put data to work for everyone
in your organization
Inspire innovation
Accelerate decision-making
Learn from & share insights
20. Units Sold, Discounts, and Profit
before Tax
27
Embrace Big Data across your business
Revenue and Target by Region Departments HeadcountXT2000 Status List
Show Only Problems
Indicator
Preliminary Budget
Materials and Packaging Review
Book Advertising Slots
Fall Showcase Event Analysis
End User Survey
Technical Review Milestone
Status 2M
1.5M
1M
0.5M
0M
Discounts(Millions)
50K 60K 70K 80K 90K 100K 110
Product A
Product D Product C
Product F
Product G
0 5 10 15
Accounting
Administration
Customer Support
Finance
Human Resources
IT
Marketing
R&D
Sales
Sales
Improve revenue
performance
HR
Maximize employee
engagement
Marketing
Build deeper customer
relationships
Finance
Impact your company’s
bottom line
0
5
10
15
0
5
10
15
(Thousands)
North South
Region: South
Target: 13450
Highlighted:
4900
Revenue Target
21. 28
The Data Divide
80%
of data
stored
70%
of data
generated by
customers
<0.5%
being
operationalized
0.5%
being
analyzed
3%
prepared for
analysis
22. Major Fail
Gartner: “Through 2017, 60% of big-data projects will fail to go beyond piloting and experimentation”
Paradigm4: 76% of those who have used Hadoop or Apache Spark complained of significant limitations
23. Analytics Solution
Capture and
integrate data
from multiple internal
and external sources
Derive insight
from data
with rich, interactive dashboards
and reports using the tools you know
Put insight
into action
to increase efficiency
and constituent satisfaction
28. Recommenda-
tion engines
Smart meter
monitoring
Equipment
monitoring
Advertising
analysis
Life sciences
research
Fraud
detection
Healthcare
outcomes
Weather
forecasting for
business
planning
Oil & Gas
exploration
Social network
analysis
Churn
analysis
Traffic flow
optimization
IT infrastructure
& Web App
optimization
Legal
discovery and
document
archiving
Data Analytics is needed everywhere
Intelligence
Gathering
Location-based
tracking &
services
Pricing Analysis
Personalized
Insurance
29. Personalized
policies can
reduce costs &
better meet
customer needs
Insurance companies can help
(and some have already started
helping) their customers with truly
personalized insurance plans
tailored to their needs and risks
Personalized Insurance
Insurance Companies can collect real-time data from in-
car sensors and combine it with geolocation and in-house
systems. With information such as distance and speed,
provide personalized insurance offers based on driving
amount, risk, and other factors, for a truly personalized
plan that may often save drivers money
$1,600/yr.
US national avg. car
insurance premium
30. The vast amount of current and ever-growing customer
purchase, rating and click data can all be collected and
managed with an Hadoop-based solution, to pinpoint
preferences based on purchase history and demographics, and
be able to serve useful and compelling cross-sell and up-sell
recommendations.
Recommendation Engines
Significantly
improve up-sell
and cross-sell
opportunities
Retailers can use customer
purchase & rating information to
serve recommendations to current
customers, based on similarities
across many dimensions
158
Items sold/second
by Amazon.com on
11/29/2010 (Cyber
Monday)
31. Retailers – whether large, small, online or in-store – can improve
margins with more detailed pricing analysis. When a customer
is in range of a transaction (either in the store, online or perhaps
passing by), offer personalized offers, real-time price quotes, or
other frequent-buyer perks to help bring more customers to the
store and improve repeat business.
Pricing Analysis
Significantly
improve sales
and customer
satisfaction
Retailers can use customer past
purchase, preference, and demo-
graphic information to serve real-
time custom pricing, instant
discounts when near the store.
up to 30%
Additional price Mac
users accepted for
travel from Orbitz
34. What is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• A place to store unlimited amounts of data in any format inexpensively
• Allows collection of data that you may or may not use later: “just in case”
• A way to describe any large data pool in which the schema and data requirements are not
defined until the data is queried: “just in time” or “schema on read”
• Complements EDW and can be seen as a data source for the EDW – capturing all data but
only passing relevant data to the EDW
• Frees up expensive EDW resources (storage and processing), especially for data refinement
• Allows for data exploration to be performed without waiting for the EDW team to model
and load the data
• Some processing in better done on Hadoop than ETL tools like SSIS
• Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)
35. Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Well manicured, often relational
sources
Known and expected data volume
and formats
Little to no change
Complex, rigid transformations
Required extensive monitoring
Transformed historical into read
structures
Flat, canned or multi-dimensional
access to historical data
Many reports, multiple versions of
the truth
24 to 48h delay
MONITORING AND TELEMETRY
36. Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Increase in variety of data sources
Increase in data volume
Increase in types of data
Pressure on the ingestion engine
Complex, rigid transformations can’t
longer keep pace
Monitoring is abandoned
Delay in data, inability to transform
volumes, or react to new sources
Repair, adjust and redesign ETL
Reports become invalid or unusable
Delay in preserved reports increases
Users begin to “innovate” to relieve
starvation
MONITORING AND TELEMETRY
INCREASING DATA VOLUME NON-RELATIONAL DATA
INCREASE IN TIME
STALE REPORTING
37. Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
OTHER REFINERY
PROCESSES
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
39. What is Hadoop?
Microsoft Confidential
Distributed, scalable system on commodity HW
Composed of a few parts:
HDFS – Distributed file system
MapReduce – Programming model
Other tools: Hive, Pig, SQOOP, HCatalog, HBase,
Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie,
ZooKeeper, Flume, Storm
Main players are Hortonworks, Cloudera, MapR
WARNING: Hadoop, while ideal for processing huge
volumes of data, is inadequate for analyzing that
data in real time (companies do batch analytics
instead)
Core Services
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
OOZIE
AMBARI
YARN
MAP
REDUCE
HIVE &
HCATALOG
PIG
HBASEFALCON
Hadoop Cluster
compute
&
storage . . .
. . .
. .
compute
&
storage
.
.
Hadoop clusters provide
scale-out storage and
distributed data processing
on commodity hardware
40. Hortonworks Data Platform 2.3
Simply put, Hortonworks ties all the open source products together (22)
41. The real cost of Hadoop
http://www.wintercorp.com/tcod-report/
42. Use cases using Hadoop and a DW in combination
Bringing islands of Hadoop data together
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)
44. What is the Internet of Things?
Connectivity Data AnalyticsThings
IoT = sensor-acquired data
45. What is the Internet of Things (IoT)?
Internet-connected devices that can perceive the environment in some way, share their data, and communicate with
you. IoT is just a catch-all term for ways of using machine-generated data to create something useful.
- Has it one processor and sensor to collect information
- Examples: heart monitoring implants, biochip transponders on farm animals, automobiles with build-in sensors, field
operation devices that assist firefighters in search and rescue
- Excludes computers, tablets, and smart phones
- But really, it’s in the sphere of business intelligence that IoT will really make a difference.
Cool possibilities
- When a milk carton is almost empty it will ping you when you are near a store
- An alarm clock that signals your coffee maker to start brewing when you wake up
- An embedded chip that monitors your vital signs and notifies a medical provider if exceeds limit
Gartner: 10 billion devices connected to the internet today, 26B by 2020
At some point in the future, nearly every manmade object will contain a device that transmits data!
47. Modern Data Warehouse
Think about future needs:
• Increasing data volumes
• Real-time performance
• New data sources and types
• Cloud-born data
• Multi-platform solution
• Hybrid architecture
52. Federated Querying
Other names: Data virtualization, logical data warehouse, data
federation, virtual database, and decentralized data warehouse.
A model that allows a single query to retrieve and combine data as it sits
from multiple data sources, so as to not need to use ETL or learn more
than one retrieval technology
53. Select… Result set
Federated Querying
Relational
Data
DB2
Oracle
MongoDB
SQL Server
Query Model
Non-
Relational
Data
Cloudera CHD Linux
Hortonworks HDP
Windows Azure
HDInsight
55. Can I use the cloud with my DW?
• Public and private cloud
• Cloud-born data vs on-prem born data
• Transfer cost from/to cloud and on-prem
• Sensitive data on-prem, non-sensitive in cloud
• Look at hybrid solutions
58. SMP vs MPP
• Uses many separate CPUs running in parallel to execute a single program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• All SQL Server implementations up until now have been SMP
• Mostly, the solution is housed on a shared SAN
SMP - Symmetric
Multiprocessing
59. 50 TB
100 TB
500 TB
10 TB
5 PB
1.000
100
10.000
3-5 Way
Joins
Joins +
OLAP operations +
Aggregation +
Complex “Where”
constraints +
Views
Parallelism
5-10 Way
Joins
Normalized
Multiple, Integrated
Stars and Normalized
Simple
Star
Multiple,
Integrated
Stars
TB’s
MB’s
GB’s
Batch Reporting,
Repetitive Queries
Ad Hoc Queries
Data Analysis/Mining
Near Real Time
Data Feeds
Daily
Load
Weekly
Load
Strategic, Tactical
Strategic
Strategic, Tactical
Loads
Strategic, Tactical
Loads, SLA
“Query Freedom“
“Query complexity“
“Data
Freshness”
“Query Data Volume“
“Query Concurrency“
“Mixed
Workload”
“Schema Sophistication“
“Data Volume”
DW SCALABILITY SPIDER CHART
MPP – Multidimensional
Scalability
SMP – Tunable in one dimension
on cost of other dimensions
The spiderweb depicts
important attributes to
consider when evaluating
Data Warehousing options.
Big Data support is newest
dimension.
60. When do you need a MPP solution?
• We need at least 3x query performance improvement
• We are near disk capacity and see a lot of growth in the upcoming years
• We need to support queries during our maintenance window
• We need to load data outside of our maintenance window
• We will spend a lot of money for FusionIO cards, SSDs, more SAN space, more
memory, faster cpu
61. Summary
• We live in an increasingly data-intensive world
• Much of the data stored online and analyzed today is more varied than the data stored in recent years
• More of our data arrives in near-real time
This present a large business opportunity. Are you ready for it?
62. Resources
The Modern Data Warehouse: http://bit.ly/1xuX4Py
Fast Track Data Warehouse Reference Architecture for SQL Server 2014: http://bit.ly/1xuX9m6
Should you move your data to the cloud? http://bit.ly/1xuXbKU
Presentation slides for Modern Data Warehousing: http://bit.ly/1xuXcP5
Presentation slides for Building an Effective Data Warehouse Architecture: http://bit.ly/1xuXeX4
Hadoop and Data Warehouses: http://bit.ly/1xuXfu9
What is the Microsoft Analytics Platform System (APS)? http://bit.ly/1xuXipO
Parallel Data Warehouse (PDW) benefits made simple: http://bit.ly/1xuXlSy
What is Advanced Analytics? http://bit.ly/1LDklkB
63. Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck will be posted)