First in Class: Optimizing the Data Lake for Tighter Integration

Grab s-me
c-ffee and
enj-y the
pre!sh-w
banter
bef-re the
t-p -f the
h-ur

The Briefing Room
First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room
Welcome
Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com
@eric_kavanagh

  Reveal the essential characteristics of enterprise
software, good and bad
  Provide a forum for detailed analysis of today s innovative
technologies
  Give vendors a chance to explain their product to savvy
analysts
  Allow audience members to pose serious questions... and
get answers!
Mission

Topics
October: DATA MANAGEMENT
November: ANALYTICS
December: INNOVATORS

What Goes In, Should Come Out
!  Well Begun = Half Done
!  Smart Architecture > Clever Queries
!  Low Cost for Planning < Optimal
!  Schema on Read ≠ Haphazard Ingestion

Analyst: Robin Bloor
Robin Bloor is
Chief Analyst at
The Bloor Group
robin.bloor@bloorgroup.com
@robinbloor

Teradata RainStor
  Teradata RainStor is well known for its data archiving
solutions
  Its capabilities include an archive on Hadoop’s HDFS, which
allows for SQL queries over the archive
  When combined with Hadoop, Teradata RainStor can enable
an optimized data lake capable of storing raw data and
acting as an enterprise system of record

Guest: Mark Cusack
Mark joined Teradata in 2014 as part of its
RainStor acquisition. As a founding developer and
Chief Architect at RainStor, he has worked on
many different aspects of the product since 2004.
Most recently, he led the efforts to integrate
RainStor with Hadoop and with Teradata. He was
formerly a senior scientist and team lead at
QinetiQ, where he researched distributed
simulation techniques and developed physics-
based models of human behavior to support
military training and operations. He also led
government and industry projects in the areas of
grid and pervasive computing. Before joining
QinetiQ, Mark worked in academia, where he
combined cluster computing methods with
quantum mechanics to predict the properties of
semiconductor microstructures. Mark holds a
Masters in Computing and a PhD in Physics from
Newcastle University.

Teradata RainStor
®
for the Data Lake

2 © 2015 Teradata
•  Cost Savings
–  Convert CapEx to OpEx
–  Decrease storage footprint
–  Future-proof capacity
•  Fast Flexible Access
–  Standards based
–  Compression optimizes queries
•  Data Governance
–  Privacy
–  Security
–  Integrity
Teradata RainStor® – The Structured Data Lake Foundation
Simple, Efficient, Scalable, Cost-Effective
Teradata RainStor is the most efficient,
scalable and accessible way to store
structured or semi-structured data in
your data lake
INGEST
At Network
Speed
COMPRESS
50-80%
Cluster Reduction
ANALYZE
10-100%
Performance Boost
RainStor Partitions
HDFS Files...

3 © 2015 Teradata
Data Lake Use Cases Applicable to RainStor
•  System of Record for Structured Data
–  Provide a trusted source of data with tracking
–  Meet commercial and regulatory requirements
•  Archive for Structured Data
–  Offload historical data
–  Central control of restore capabilities
•  Discovery
–  Data profiling to discover correlations
•  Analysis
–  Custom analytics, signal analysis, and event reporting
•  ETL Remix
–  Staging platform for data cleanup prior to EDW analysis

4 © 2015 Teradata
Archive or a System of Record
Depends on the position of RainStor with respect to the source
Warehouse or
Database
Warehouse or
Database
Archive
System
of
Record
Source
Source
RainStor
RainStor

5 © 2015 Teradata
QUERY
SQL
BI Tools: Hive,
MapReduce
SCALE – Any Platform (MPP, Shared Everything)
COMPRESSLOAD
Billions
Records/Day
10-40X
(90%+)
AVAILABILITY
Replication
EDW/DB
GOVERN
Rules
Based
SECURE – Enterprise-Grade
Network
Tape
Hadoop
NAS, CAS
Apps
MOVE
Teradata RainStor® Overview

6 © 2015 Teradata
Challenges
–  Log, clickstream, and sensor data,
tax import systems
–  Encryption is expensive
–  Extended wait times for data access
–  Maintaining data integrity
RainStor Solutions
–  MPP (scalable data load)
–  Encryption of compressed data
–  Data immediately available for query
–  Fingerprinted
Data
Collection
Process
RainStor Node 1
Service
Manager
Load
Query
Fork
RainStor Node N
HDFS Data
Node
“Move your costs by a decimal point!”
~ Architect, Global Financial Services Company
How does RainStor do this?
–  Separately stages
delimited text data
–  De-dupes and builds
partitions stored on HDFS
–  Shows multi-node query
process view of data
across HDFS
Source
Staging
Area
Data Load

7 © 2015 Teradata
Challenges
–  Storage costs outstrip budgets
–  Queries take longer than ever
RainStor Solutions
! Patented compression techniques
! In-memory and on disk compression is
performance multiplier and storage saver
! Stored in binary tree format
! Algorithms that query compressed data
! Hardware & bandwidth multiplier
! 2-10X more compressed than ORC
! Cost saving on floor, cooling and personnel
! Drives efficient query execution framework
! CPU rather than IO bound
“Now we can keep years of history that
wasn’t economically feasible until now.”
~ Architect, Communication Service Provider
Compression
0001
0002 200
100
0003
0004
$12
$13
AA
BAC
Stock Trades Example

8 © 2015 Teradata
Challenges
–  Query speed is #1 concern
–  Hive queries aren’t standards based
–  Rewriting queries is a huge task
–  Data transparency
RainStor Solutions
! SQL access – 2-10x faster than Hive
! Improved Hive performance
! Efficient parallel query execution
! User defined functions supported
! Teradata connectivity via QueryGrid
! BI Tool access
! Query access via HCatalog
Query
PPPPPP
P
Static
MetadataSQL
Hive
Pig
HCatalog
Predicates
Bloom Filter
Dynamic
Fields StatsTypes
PPPPP
HDFS
“RainStor doesn’t care what hardware it runs on.
It’s just as good on Tier-2 or Tier-3 hardware.”
~ Chief Architect, Global Investment Bank

9 © 2015 Teradata
RainStor Governance
•  Data Encryption
•  Data Masking
•  Log Masking
•  View-Based Dynamic Masking
•  Authentication
–  Kerberos
–  LDAP/AD
–  Linux PAM
•  SQL92 Authorization
•  Immutable Data Model
•  Record-Level Delete
•  Schema Evolution
•  Data Disposition
•  Replication
•  Audit Trail
Privacy
SecurityIntegrity
Designed to support PCI-DSS,
SEC17a-4, etc.
“How did you guys get it right and others didn’t.”
~ Architect, U.S. Bank

10 © 2015 Teradata
RainStor 7 Architecture
Apache Teradata RainStor
®
Teradata IDW & Tools
RainStor Files
MapReduce
Teradata BARTeradata IDW
Hive
Pig Java
HCatalog
MapReduce / YARN
Teradata
QueryGrid™
Interactive SQL
Oracle, SQLServer, SybaseIQ, Netezza extensions
ODBC
JDBC
Data
LoaderFastConnect™ FastForward™
!
HDFS (CDH/HDS)
Management Alerting
Security
Retention Rules
Replication
Compliance
!
NAS, CAS, SAN, WORM
Vendor specific
User-Defined
Functions

11 © 2015 Teradata
Integration with Teradata QueryGrid
TERADATA
ASTER
RAINSTOR
ON HADOOP
TERADATA
DATABASE
HADOOP OTHER
DATABASES
TD QueryGrid Support for RainStor
Business users Data scientists

12 © 2015 Teradata
RainStor Integration with Teradata
FastForward™
FastConnect™
10111001001010110
1011100100101011010111001001010110
BAR PBs of history
QueryGrid™
10111001001010110
10111001001010110
PS Engagement
RainStor

13 © 2015 Teradata
 US Telco Case Study #1: Data Lake
 Network Performance
 Problem
•  Storage & analysis of network events
–  Performance, faults, changes
 Challenges
•  50TB raw data/day
•  Demanding query SLAs
 Results
•  Storing 30 days data – up from 3 days
•  8 node Hadoop cluster
•  83% reduction in storage footprint
•  Data lake system of record
Dual Load
RainStor
“RainStor addresses data
growth at the root cause.”
~ Architect, U.S. Bank
Network
Events

14 © 2015 Teradata
 Compliant and Secure Analytics
Problem
•  Usage data must be encrypted on Hadoop
•  Avoid any query performance impact
Challenge
•  Deliver cost-effective & secure scalability
Solution
•  RainStor 15x compression vs. ORC 7x
•  Encryption with only 3% query overhead
•  Queries 3X faster than Hive
Clickstream/Usage Data
Network
Customer
Data Extract/
Scrubbing
1.2PB
62 Nodes
Running Hortonworks 2.1
RainStor
“We keep finding new stuff
we can do with RainStor!
We are just getting rolling!
~ Principal Architect, Global CSP

15 © 2015 Teradata
Application Retirement & Access
Problem
•  Hundreds of applications taking up space
Challenges
•  Needed to lower TCO
•  Free up capacity and maintain user access
Results
•  Hundreds of apps retired into RainStor
•  Users access data using BI tool of their choice
•  Administration is minimal on low cost NAS
•  Saving $800K for every 100TB stored in RainStor
AfterBefore
“I installed RainStor in less than
5 minutes and was querying the
data 30 minutes later.”
~ Principal Architect, U.S. Telco
RainStor

16 © 2015 Teradata
•  Cost Savings
–  Convert CapEx to OpEx
–  Decrease storage footprint
–  Future-proof capacity
•  Fast Flexible Access
–  Standards based
–  Compression optimizes queries
•  Data Governance
–  Privacy
–  Security
–  Integrity
Teradata RainStor® – The Structured Data Lake Foundation
Simple, Efficient, Scalable, Cost-Effective
Teradata RainStor is the most efficient,
scalable and accessible way to store
structured or semi-structured data in
your data lake
INGEST
At Network
Speed
COMPRESS
50-80%
Cluster Reduction
ANALYZE
10-100%
Performance Boost
RainStor Partitions
HDFS Files...

19 © 2015 Teradata
Teradata Appliance for Hadoop
•  Future-proof capacity
–  2x to 8x more compressed, including ORC
•  Fast analysis (2x to 100x performance boost)
–  Mature SQL stack
-  Multiple parsers – Oracle, SQL Server, Sybase
–  Fast Hive QL, Pig, MapReduce
–  Support for BI tool
•  Security and compliance
–  Encryption
–  LDAP/AD/PAM/Kerberos/PCI
–  SQL92 users, tables, views, and data masking
–  Audit trails & logging
•  Life cycle management
–  Retention rules & expiry policies
–  Schema evolution
•  Faster time-to-value

Perceptions & Questions
Analyst:
Robin Bloor

The Quality of the
Data Lake
Robin Bloor, PhD

The Departure Point
A data lake is a WHOLLY NEW
architectural idea

But Not Much Changed
" Nothing changed in respect to enterprise
operational discipline
" Nothing changed in respect to service level
policy
" Nothing changed in respect to data
governance (although it may have gotten
more demanding)
" Possibly the data got dirtier
" Security became more onerous
" Some things became more onerous
" Data volumes increased

Hadoop: Good, Bad, Ugly
" GOOD: scalability and
parallelism, some
components (like Kafka
and Presto), costs
" BAD: security, lack of
system management
components, some
components (like Hive)
" UGLY: Lack of stability,
a servant with three
masters, skills and
experience, cultural
issues

The Consequence
You need to make sensible
COMPONENT decisions and sensible
ARCHITECTURAL decisions

"  Can RainStor simply be used as a SQL-capable
query-only database sitting on Hadoop? What
are the gating factors?
"  How fast is data ingest? Are there any limits to
how this is done?
"  What is the data compression limitation, if
any? How much space would be saved over
Hive or HBase?
"  Walk me through a data lake implementation.

"  Is there any Hadoop distribution that you prefer,
or doesn’t it matter?
"  What if I’m not a Teradata user? Is there any
downside to using RainStor?

Upcoming Topics
www.insideanalysis.com
October: DATA MANAGEMENT
November: ANALYTICS
December: INNOVATORS

THANK YOU
for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons

First in Class: Optimizing the Data Lake for Tighter Integration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to First in Class: Optimizing the Data Lake for Tighter Integration

Similar to First in Class: Optimizing the Data Lake for Tighter Integration (20)

More from Inside Analysis

More from Inside Analysis (20)

Recently uploaded

Recently uploaded (20)

First in Class: Optimizing the Data Lake for Tighter Integration