Hadoop and the Data Warehouse: When to Use Which

HADOOP & THE DATA WAREHOUSE:
WHEN TO USE WHICH
Steve Wooledge – Teradata Labs
Jim Walker – Hortonworks
1

Topics
• Trends in enterprise data architectures
• The value of an integrated data warehouse
• The value of Hadoop
• Bringing it all together and next steps

Big Data Comes with BIG HEADACHES
Even free software like Hadoop is causing
companies to spend more money…Many CIOs believe
data is inexpensive because storage has become
inexpensive. But data is inherently messy—it can be
wrong, it can be duplicative, and it can be irrelevant—
which means it requires handling, which is where the
real expenses come in.
“
”
Through 2015, 85% of Fortune 500 organizations will
be unable to exploit big data for competitive advantage.
“ ”Source: The Wall Street Journal. “CIOs’ Big Problem with Big Data”. Aug 2012
Source: Gartner. “Information Innovation: Innovation Key Initiative Overview”. April 2012

Organizations Face Several Obstacles with Big Data
Source: Big Analytics 2012 Survey, Teradata
Difficulty
managing
multiple systems,
new types of data
Hard to find right
skills; Lack of
supportability
for new systems &
“data scientists”
Difficulty
deploying and
integrating new
systems
Difficulty
providing
accessibility to
fast insights on
big data

Shift from a Single Platform to an Ecosystem
“Big Data requirements are solved by
a range of platforms including
analytical databases, discovery
platforms, and NoSQL solutions
beyond Hadoop.”
“We will abandon the old models
based on the desire to implement for
high-value analytic applications.”
"Logical" Data Warehouse
Source: “Big Data Comes of Age”. EMA and 9sight Consulting. Nov 2012.

AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
DISCOVERY
PLATFORM
CAPTURE | STORE | REFINE
INTEGRATED
DATA WAREHOUSE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
Engineers
Data Scientists
Business Analysts
Front-Line WorkersCustomers / PartnersMarketing
Operational SystemsExecutives
TERADATA UNIFIED DATA ARCHITECTURE

DUAL
SYSTEMS
DATA
MARTS
ANALYTICAL
ARCHIVE
TEST/
DEV
The Value of The Data Warehouse
INDEPENDENT
DATA MART
Business Analysts
Knowledge Workers
DATA MININGBUSINESS INTELLIGENCE APPLICATIONS
Customers/Partners
Marketing
Executives
Front-line Workers
Operational Systems
INTEGRATED
DATA WAREHOUSE
DATA
LAB
Integrated Analytics
Advanced
Analytics
Temporal
OLAP
Optimization
Geospatial
Big Data
Integration
Application
Development
Agile
Analytics
Data
Exploration
Benefits
•Easy to consume data
•Rationalization of data
from multiple sources
into single enterprise
view
•Clean, safe, secure data
•Cross-functional
analysis
•Transform once, use
many
•Fast response times

SQL Advantages with an MPP RDBMS
• Full ANSI SQL:
• The lingua franca of business users when accessing data
• Decades of standardization (stable, feature rich, portable)
• Mature 3rd Party SQL based tools that provide business users with
self service direct access to the data
• BI Tools
• In-database statistical packages
• Analytic applications (CRM, SCM, MDM)
• Easily parallelized
• Scalable when manipulating large data sets
6/27/2013 9

ACID Advantages in an MPP RDBMS
• Guarantees database actions are
processed reliably
• Ensures 100% query result accuracy
• Supports updates and deletes
• Needed for applications that require
100% consistency
6/27/2013 10
Atomicity - All of the pieces are
committed or none are committed.
Consistency - Creates a new and
valid state of data, or, if any failure
occurs, returns all data to its original
state.
Isolation - Processed and not yet
committed transactions must remain
isolated from any other
transactions.
Durability - Committed data is
saved such that in event of a failure
and system restart, the data is
available in its correct state.

Tight Vertical Integration
• End-to-end management of resources
• Efficient utilization of resources
• Engineered extremely well for known data
• Fine-grained parallelism and resource management
• Consistency of service level delivery
Best Practices Management:
• Workload functions
• Workload groups
• Exceptions
• Priorities
• Time periods

Low Latency Advantages of MPP RDBMS
Multi-temperature
storage with automated
distribution of data based
on access patterns:
• In-Memory
• Solid-State Drives
• Fast Hard Drives
• Fat Hard Drives
6/27/2013 12
• Indexes
• Statistics
• Advanced partitioning

Cost Based Optimizer Advantages in an MPP RDBMS
• Best practices optimizer determines how
the query will be processed most
efficiently, with no “hints” or degrees of
parallelism necessary.
• In chess, you can look out a few moves
to decide your best next move, but you
can’t envision all move and countermove
sequences for the entire game:
• The Grand Master has the
knowledge, experience, and
intelligence to identify and use
the right strategy.
• With Hadoop, the user takes a
heavy role in optimizing the
execution of queries.
• With an MPP RDBMS, the
software is the optimizer.
6/27/2013 13
Query Rewrite
• semantic optimization
• different types of vendor tools
Fast/Efficient Data Access
• Access path - Indexing
• Partitioning (CP & PPI)
• Advanced partitioning schemes
(range & case based, multilevel,
dynamic)
• IO Optimizations (efficient
scans/sync scan) scan optimization
Query Complexity
• Join costing & planning
• Aggregation
Many ways to process a complex query…

Granular Security Advantages in an MPP RDBMS
• Row level security
• Column level security
• An MPP RDBMS tightly integrates mature security features
• User-level security controls
• Increased user authentication options
• Support for security roles
• Enterprise directory integration
• Auditing and monitoring controls
• Encryption
6/27/2013 14

MPP RDBMS Customer Examples
6/27/2013 15

© Hortonworks Inc. 2012
By the year 2015, we believe half the worlds
data will be processed by Apache Hadoop
Key Hadoop Features for the EDW
•Storage/Processing
•Metadata

Data
Explosion
The World of Data is Changing
Page 18
By 2015, organizations that build a modern information management
system will outperform their peers financially by 20 percent.
– Gartner, Mark Beyer, “Information Management in the 21st Century”
1 Zettabyte
(ZB)
=
1 Billion
TBs
15x
growth rate of
machine
generated data
by 2020
Source: IDC

StorageApache Hadoop: Center of Big Data Strategy
Open Source data management
with scale-out storage &
distributed processing
Page 19
HDFS
• Distributed across “nodes”
• Natively redundant
• Name node tracks locations
Processing
Map Reduce
• Splits a task across processors
“near” the data & assembles results
• Self-Healing, High Bandwidth
Clustered Storage
Key Characteristics
• Scalable
– Efficiently store and process
petabytes of data
– Linear scale driven by additional
processing and storage
• Reliable
– Redundant storage
– Failover across nodes and racks
• Flexible
– Store all types of data in any format
– Apply schema on analysis and
sharing of the data
• Economical
– Use commodity hardware
– Open source software guards
against vendor lock-in

HCatalog
Table access
Aligned metadata
REST API
• Raw Hadoop data
• Inconsistent, unknown
• Tool specific access
Apache HCatalog provides flexible metadata
services across tools and external access
Metadata Services
• Consistency of metadata and data models across tools
(MapReduce, Pig, HBase and Hive)
• Accessibility: share data as tables in and out of HDFS
• Availability: enables flexible, thin-client access via REST API
Shared table
and schema
management
opens the
platform

Page 21
“how to” deliver an open
source enterprise product
• Identify requirements
• Open community delivery
• Enterprise rigor
Apache
Hadoop
Test &
Patch
Design & Develop
Release
Apache
Pig
Apache
HCatalo
g
Apache
HBase
Other
Apache
Projects
Apache
Hive
Apache
Ambari
An Open Apache Community
Fastest path to innovation is an open community

Big Data: It’s About Scale & Structure
Page 22
RDBMS HadoopNoSQLMPPEDW
best fit use
schemaRequired on write Required on read
speedReads are fast Writes are fast
governanceStandards and structured Loosely structured
processingLimited, no data processing Processing coupled with data
data typesStructured Multi and unstructured
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Data Discovery
Processing unstructured data
Massive Storage/Processing
costSoftware License Support only
resourcesKnown entity Growing, complexities, wide

An Emerging Data Architecture
Page 23
APPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
OPERATIONAL
TOOLS
MANAGE &
MONITOR
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Enterprise
Applications
HORTONWORKS
DATA PLATFORM

Interoperating With Your Tools
Page 24
APPLICATIONSDATASYSTEMS
DEV & DATA
TOOLS
OPERATIONAL
TOOLS
Viewpoint
Microsoft Applications
HADOOP
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)

By the year 2015, we believe half the worlds
data will be processed by Apache Hadoop
Key Hadoop Features for the EDW
•Storage/Processing
•Metadata
•FAMILIARITY

Confidential and proprietary. Copyright © 2013 Teradata Corporation.30
Teradata Unified Data Architecture
• Hadoop
- Collect ALL
interaction data
• Teradata Aster
- Discovery customer
behavioral patterns
• Teradata
- Operationalize
Insights
The right technology on the right analytical problems using best of
breed technologies

Improved Customer Service and Retention
Hadoop
captures, stores
and transforms
social, images
and call records
Path, pattern &
graph analysis
Data Sources
Multi-Structured
Raw Data
Call Center
Voice Records
Check Images
Traditional Data Flow
Analysis +
Marketing
Automation
(Customer
Campaign)
Capture, Store
and Refine Layer
ETL Tools
Hadoop
Call Data
Integrated DW
DimensionalData
AnalyticResults
Discovery
Platform
Sentiment
Scores
SOCIAL
FEEDS
CLICKSTREAM
DATA

Teradata Workload-Specific Platforms
670
1650
2700
6700
Data Mart
Appliance
Extreme
Data
Appliance
Data
Warehouse
Appliance
Active
Enterprise
Data
Warehouse
Appliance for
Hadoop
Aster Big
Analytics
Appliance
SAS High
Performance
Analytics
Scale Up to 12TB Up to 186PB Up to 1.6PB Up to 61PB Up to 10PB Up to 5PB Up to 52TB
Work-
loads
Test /
Development
or Smaller
Data Marts
Analytical
Archive,
Deep Dive
Analytics
Strategic
Intelligence,
Decision
Support
System, Fast
Scan
Strategic &
Operational
Intelligence,
Real Time
Update, Active
workloads
Appliance for
Storing,
Capturing and
Refining Data.
Hortonworks
HDP 1.1
Discovery
Platform for
Big Data
Analytics with
embedded SQL
MapReduce for
new data types
& sources
Dedicated
appliance for
SAS high-
performance
analytic model
development
700

Teradata Unified Data Architecture
• Hadoop
- Collect ALL
interaction data
• Teradata Aster
- Discovery customer
behavioral patterns
• Teradata
- Operationalize
Insights
The right technology on the right analytical problems using best of
breed technologies
SQL-H SQL-H
Aster-Teradata
Connector
Aster Connector
for Hadoop
Teradata Connector
for Hadoop

Teradata SQL-H™
A Business User’s Bridge to Access Hadoop Data
Teradata SQL-H Gives Business
Users a Better Way to Access
Data Stored in Hadoop
• Trusted: Use existing tools/skills and
enable self-service BI with granular
security
• Allow standard ANSI SQL access to
Hadoop data
• Fast: Queries run on Teradata, data
accessed from Hadoop
• Efficient: Intelligent data access
leveraging the Hadoop HCatalog
Hadoop Layer: HDFS
Pig
Hive
Hadoop
MR
Teradata: SQL-H
HCatalog
Data
DataFiltering

The App Store of Big Data
PATH ANALYSIS
Discover Patterns in Rows of
Sequential Data
TEXT ANALYSIS
Derive Patterns and Extract
Features in Textual Data
STATISTICAL ANALYSIS
High-Performance Processing of
Common Statistical Calculations
SEGMENTATION
Discover Natural Groupings of
Data Points
MARKETING ANALYTICS
Analyze Customer Interactions to
Optimize Marketing Decisions
DATA TRANSFORMATION
Transform Data for More
Advanced Analysis
Graph Analysis
Graph analytics processing and
visualization
SQL-MapReduce
Visualization
Graphing and visualization tools
linked to key functions of the
MapReduce analytics library
Aster Discovery Portfolio: Accelerate Time to Insights
Some of the 80+ out-of-the-box analytical apps

Big Data Analytics & Discovery
Example Customers: Teradata Aster Big Analytics Appliance
XL Axiata

Discovering Deep Insights in Retail
Transforming Web Walks into DNA Sequences
Situation
Large retailer with 700M
visits/year, 2M customers / day
look at 1M products online
Problem
Increase ability of web content
owners to self-serve insights
Solution
Treat web walks like DNA
sequences of simple patterns.
Impact
• Data: loaded logs into Hortonworks
• Loaded 2 months of raw data in 1
hour, vs. 1 day on old system
• Can load a day’s log data in 60 sec
• Sessionize: Creates sequence for
visit, e.g., boils 20 customer clicks
down to 1 line:
• <Home –Search -Look at Product -
Add to Basket – Pay – Exit>
• Analyze: Business analysts can now
do path analysis
• Act:
• Segmentations by behavior can
increase conversion rates by 5-10%.
• Web design changes can drive
another 10-20% more visitors into
the sales funnel

Example: Online Checkout Flow Analysis
• Customers who have reached the checkout process follow an “ideal path”.
• deliveryslots > deliveryinformation > coupons > substitutions > paymentinfo > orderconfirmation
• Determine how and when (and ultimately, why) customers deviate from this path.
• Discover obstacles preventing purchase and optimize visitor flow through the web site.
• The Aster SQL-MapReduce Framework enables a variety of different path visualizations.

Teradata Portfolio for Hadoop
”Taking Hadoop from Silicon Valley to Main Street”
Most Trusted & Flexible Hadoop Platforms for Your Next-Generation
Unified Data Architecture™
1. Teradata Aster Big Analytics Appliance
2. Teradata Appliance for Hadoop
3. Teradata Commodity Offering for Hadoop (Dell)
4. Teradata Software-only for Hadoop (Hortonworks Data Platform)
Complete consulting and training capability
• Big Analytics Services – across the UDA
• Data Integration Optimization – ETL, ELT across the UDA
• Hadoop deployment & mentoring
• Teradata delivering Hortonworks training
• Hadoop Managed Services - operations & administration
Customer Support for Hadoop
• World-class Teradata customer support, backed by Hortonworks
What We Announced Today

Teradata Appliance for Hadoop
Value-Added Software Bringing Hadoop to Enterprise
Access: SQL-H™
Management: Viewpoint, TVI
Administration: Hadoop Builder,
Intelligent start/stop, DataNode
swap, deferred drive replace
High Availability : NameNode
HA, Master Machine Failover
Refining, Metadata,
Entity Resolution
Security & Data Access
HCatalog KerberosKerberos

41 6/27/2013 Teradata Confidential
Complete Consulting and Training Capability
Post-sale Services Areas of Focus
Teradata Analytic
Architecture Services
Services to scope, design, build, operate and maintain an optimal UDA approach
for Teradata, Aster, and Hadoop
Teradata DI Optimization Assess structured/non-structured data, discuss data loading techniques,
determine best platform, optimize load scripts/processes
Teradata Big Analytics Assess data value/cost of capture, identify source of “exhaust” data, create
conceptual architecture, refine and enrich the data, implement initial analytics in
Aster or best-fit tool
Teradata Workshop for
Hadoop
Introduction workshop (across all of UDA)
Teradata Data Staging for
Hadoop
Load data into landing-area; set-up data exploration/refining area; Scope
architecture and analytics; set-up Hadoop repository; Load sample data
Teradata Platform for
Hadoop
Installation guidance and mentoring for Hadoop platform, D-I-Y after installation
Teradata Managed
Services for Hadoop
Operations, management, administration, backup, security, process control for
Hadoop
Teradata Training Courses
for Hadoop
Two comprehensive, multi-day training offerings: 1) Administration of Apache
Hadoop and 2) Developing Solutions Using Apache Hadoop

42 6/27/2013 Teradata Confidential
When to Use Which?
The best approach by workload and data type
Processing as a Function of Schema Requirements and Stage of Data Pipeline
Low Cost
Storage and
Fast Loading
Data Pre-
Processing,
Refining,
Cleansing
“Simple math
at scale”
(Score, filter,
sort, avg.,
count...)
Joins,
Unions,
Aggregates
Analytics
(Iterative and
data mining)
Reporting
Stable
Schema
Evolving
Schema
Aster
(SQL +
MapReduce
Analytics)
Format,
No Schema
Hadoop Hadoop Hadoop Aster Aster
Aster
(MapReduce
Analytics)
Teradata/
Hadoop
Teradata Teradata Teradata Teradata Teradata
Hadoop
Aster /
Hadoop
Aster /
Hadoop
Aster Aster Aster
Hadoop Hadoop Hadoop Aster Aster Aster
Financial Analysis, Ad-Hoc/OLAP
Enterprise-Wide BI and Reporting
Spatial/Temporal
Active Execution
Interactive Data Discovery
Web Clickstream, Set-Top Box Analysis
CDRs, Sensor Logs, JSON
Social Feeds, Text, Image Processing
Audio/Video Storage and Refining
Storage and Batch Transformations

When to Use Which?
The best approach by workload and data type
Processing as a Function of Schema Requirements and Stage of Data Pipeline
Low Cost
Storage and
Fast Loading
Data Pre-
Processing,
Refining,
Cleansing
“Simple math
at scale”
(Score, filter,
sort, avg.,
count...)
Joins,
Unions,
Aggregates
Analytics
(Iterative and
data mining)
Reporting
Stable
Schema
Evolving
Schema
Aster
(SQL +
MapReduce
Analytics)
Format,
No Schema
Hadoop Hadoop Hadoop Aster Aster
Aster
(MapReduce
Analytics)
Teradata/
Hadoop
Teradata Teradata Teradata Teradata Teradata
Hadoop
Aster /
Hadoop
Aster /
Hadoop
Aster Aster Aster
Hadoop Hadoop Hadoop Aster Aster Aster

6/27/2013 44
Questions
and Answers
Thank You!

Hadoop and the Data Warehouse: When to Use Which

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop and the Data Warehouse: When to Use Which

Similar to Hadoop and the Data Warehouse: When to Use Which (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop and the Data Warehouse: When to Use Which