Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
Boost PC performance: How more available memory can improve productivity
Hadoop and the Data Warehouse: Point/Counter Point
1. Grab some coffee and enjoy
the pre-show banter before
the top of the hour!
2. Hadoop and the Data Warehouse: Point/Counter Point
The Briefing Room
3. Twitter Tag: #briefr
The Briefing Room
Welcome
Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com
@eric_kavanagh
4. ! Reveal the essential characteristics of enterprise software,
good and bad
! Provide a forum for detailed analysis of today’s innovative
technologies
! Give vendors a chance to explain their product to savvy
analysts
! Allow audience members to pose serious questions... and get
answers!
Twitter Tag: #briefr
The Briefing Room
Mission
5. Twitter Tag: #briefr
The Briefing Room
Topics
This Month: BIG DATA
May: DATABASE
June: ANALYTICS & MACHINE LEARNING
2014 Editorial Calendar at
www.insideanalysis.com/webcasts/the-briefing-room
7. Twitter Tag: #briefr
The Briefing Room
Analyst: Robin Bloor
Robin Bloor is
Chief Analyst at
The Bloor Group
robin.bloor@bloorgroup.com
@robinbloor
8. Twitter Tag: #briefr
The Briefing Room
Teradata
! Teradata is known for its analytics data solutions with a
focus on integrated data warehousing, big data analytics
and business applications
! It offers a broad suite of technology platforms and solutions
and a wide range of data management applications
! Teradata’s SQL-H allows users and applications to join
Hadoop data to the Teradata Data Warehouse and the Aster
Discovery Platform
9. Twitter Tag: #briefr
The Briefing Room
Guest: Dan Graham
Dan Graham is the Technical Marketing
Director for Teradata. With over 30 years in IT,
Dan joined Teradata Corporation in 1989
where he was the senior product manager for
the DBC/1012 parallel database computer. He
then joined IBM where he wrote product plans
and launched the RS/6000 SP parallel server.
He then became Strategy Executive for IBM’s
Global Business Intelligence Solutions. As
Enterprise Systems General Manager at
Teradata, Dan was responsible for strategy,
go-to-market success, and competitive
differentiation for the Active Enterprise Data
Warehouse platform. He currently leads
Teradata’s technical marketing activities.
10. HADOOP AND
THE DATA WAREHOUSE
Point, Counterpoint
Myths and Magic
11. AGENDA
• Words and modern terminology
• Is Hadoop a data integration product?
• Is Hadoop a data warehouse?
• Hadoop – the magic
• Teradata and Hadoop
http://www.teradata.com/analyst-reports/Hadoop-and-the-Data-Warehouse-Competitive-or-Complementary/
11 Copyright Teradata
12. Our Host, the Word Smith
@RobinBloor
12 Copyright Teradata
14. Now we have something that can provide
us “Real-time” in Hadoop
• At least most of the time
• Queries are significantly faster but not
always instantaneous
> Simple selects à A couple of seconds
> Join queries à 10s of seconds
Source: Slideshare, Real Time Interactive Queries IN HADOOP:
Big Data Warehousing Meetup, June 2013
14 Copyright Teradata
15. Term Hadoop meaning BI/DW meaning
Real time
query
• Self- service interactive
queries that run in under
minutes, preferably < 10s
of seconds
• Query responses in milliseconds
• +Advanced query prioritization
SQL
• Subset of ANSI 92 SQL
• Primary data types
• UDFs
• ANSI 2008 SQL +
• Some/all ANSI SQL 2011
• All SQL data types
• Integrity constraints, window
15 Copyright Teradata
functions, UDFs, triggers, XML
• ACID transactions (start
transaction, commit, rollback)
• Geospatial, temporal
OLAP • Any query < 10 seconds
• Subsecond multi-dimensional
aggregate queries
• Roll-up, drill-down hierarchies
• MOLAP and ROLAP
Hadoop Translator
See: Wikipedia
16. Shoop, Shoop Hadoop!
q Real-Time Query: Real-Time is really business
time. It is almost always performance critical
(otherwise why would you engineer for it?).
q SQL sophistication depends on what you want to
use it for. SQL-92 is rather primitive. There are
consequences – performance consequences.
q The appropriateness of Hadoop Interactive (OLAP)
capability is user dependent. But why would you
use Hadoop for this?
16 Copyright Teradata
17. Current HDFS Availability & Data Integrity
• Simple design, storage fault tolerance
> Storage: Rely in OS’s file system rather than use raw disk
> Storage Fault Tolerance: multiple replicas, active monitoring
> Single NameNode Master
– Persistent state: multiple copies + checkpoints
– Restart on failure
• How well did it work?
> Lost 19 out of 329 Million blocks on 10 clusters with 20K
nodes in 2009
– 7-9’s of reliability
– Fixed in 20 and 21.
> 18 months Study: 22 failures on 25 clusters - 0.58 failures
per year per cluster
– Only 8 would have benefitted from HA failover!! (0.23 failures
per cluster year)
> NN is very robust and can take a lot of abuse
– NN is resilient against overload caused by misbehaving apps
Source: Slideshare, NameNode HA, 2011
17 Copyright Teradata
18. Term Hadoop
meaning
18 Copyright Teradata
BI/DW meaning
High
Availability
• Data replication
• Name node fail
over
• Redundant access paths (network,
nodes, disks)
• RAID storage, high quality hardware
• Minimized planned downtime
• No single point of failure
• HA administration tools, event alerts
tracking and auto recovery
• Backups
Fault tolerant
Query automatically
restarts on another
node without
resubmission using
replicated data
• Nonstop system (no unplanned
system halt or reboot)
• Extreme hardware reliability
• 99.999% uptime
• Fault isolation and containment
• Graceful degradation
• Rolling upgrades
Hadoop Translator
19. Hadoop Falling Over!
q Hadoop was built for the recovery of large batch
on large commodity grids.
q The goal was not to lose the work
q This is really about disk failure
q HA/FT is always configured according to workload
characteristics. Enterprise HA is best thought of
as “transactional” and OLTP, at the least, if not a
real-time event.
19 Copyright Teradata
20. Is Hadoop a Data Integration Platform?
• Yes
> “Lots of customers doing
ETL in Hadoop”
> Data refineries
> Unstructured data
– Weblogs and sensor data
> Data Hub/Data Lake
• No
> No built-ins
20 Copyright Teradata
– Data quality tools
– Transformations
> All do-it-yourself code
> No ETL process
management
> No metadata repository
21. Hadoop Is Not a Data Integration Solution
• Data integration requires a method for rationalizing inconsistent
semantics, which helps developers rationalize various sources of
data (depending on some of the metadata and policy capabilities
that are entirely absent from the Hadoop stack).
• Data quality is a key component of any appropriately governed
data integration project. The Hadoop stack offers no support
for this, other than the individual programmer's code, one
data element at a time, or one program at a time.
• Because Hadoop work streams are independent — and separately
programmed for specific use cases — there is no method for
relating one to another, nor for identifying or reconciling underlying
semantic differences.
29 January 2013
21 Copyright Teradata
22. Unstructured Data in the Data Warehouse
• Facebook, Twitter, LinkedIn
• Sensor data
• XML
• Web logs
• JSON
• eMail
• Documents
• Images
• Not so much
> Audio
> Video
20%
15%
10%
5%
0%
2013
Social XML Docs eMail Web JPGs A/V Sensor
22 Copyright Teradata
logs
Sources: Derived from TDWI, Wikibon, Gartner, IDC
23. Hadoop For Data Integration!
q Hadoop serves a useful function as a
data reservoir.
q The revenge of the ISAM file
q Some ETL
q Some cleansing
q Some analytics
q Personally, I would want drag and drop
ETL, ELT. Those who write code
maintain code.
23 Copyright Teradata
24. Is Hadoop a Data Warehouse?
24 Copyright Teradata
25. Scaling the Facebook Data
Warehouse to 300 PB
At Facebook, we have unique storage scalability challenges when it comes to our
data warehouse. Our warehouse stores upwards of 300 PB of Hive data,
with an incoming daily rate of about 600 TB. In the last year, the warehouse has
seen a 3x growth in the amount of data stored. Given this growth trajectory,
storage efficiency is and will continue to be a focus for our warehouse
infrastructure.
There are many areas we are innovating in to improve storage efficiency for the
warehouse – building cold storage data centers, adopting techniques like RAID in
HDFS to reduce replication ratios (while maintaining high availability), and using
compression for data reduction before it’s written to HDFS. The most widely used
system at Facebook for large data transformations on raw logs is Hive, a query
engine based on Corona Map Reduce used for processing and creating large
tables in our data warehouse. In this post, we will focus primarily on how we
evolved the Hive storage format to compress raw data as efficiently as possible
into the on-disk data format.
https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
25 Copyright Teradata
April 10, 2014
26. What is a Data Warehouse?
• A data design pattern, an architecture
> Size doesn’t matter
> A perpetual evolution
• Definition: Gartner (2005) /Inmon (1992) /Wikipedia
> Subject oriented
– Detailed data + modeling of sales, inventory, finance, etc.
> Integrated logical model
– Merged data
– Consistent, standardized data formats and values
> Nonvolatile
– Data stored unmodified for long periods of time
> Time variant
– Record versioning or temporal services
> Persistent storage, not virtual, not federated
Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘;
Bill Inmon, Building the Data Warehouse, 1992, Wiley and Sons
26 Copyright Teradata
27. Subject Areas: A Model of ‘Our’ Business
Price
history
Point of Sale
Product/Services
Inventory
Supplier
Contracts
27 Copyright Teradata
Labor
Associate
E-Commerce
Channels
Customer
Sales
transactions
Carrier Shipment
Campaigns
Promotion
Warehouse
Each subject area has numerous large FACT tables (=big joins)
28. You Wish You Had Redundant Data!
Match keys
App Cust_ID First Last DOB Social Address
ERP 30391-244 William Franks 04/12/00 563-49-1234 123 Oak, Atlanta
CRM 30391244 W. Franks 04/12/70 563491234
SCM 30391244 Bill Franks 04/12/70 Atlanta
XYZ 30391-244 Frank Williams 563491234 123 Oak St. #14
Cust_ID First Last DOB Social Address
30391244 William Franks 04/12/70 563491234 123 Oak St. #14
Final integrated record
28 Copyright Teradata
ETL
29. What is a Data Mart?
• A targeted project that will be finished
> A subset of data, not all the data
> Not for all of the people
• Often heavily denormalized
• Volatility
> Often completely reloaded
• Time variance and currency
> Can restate the data “as of” a point in time
• Virtualization option
> Can be a logical set of views, cubes
Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘;
Inmon, Building the Data Warehouse, 1992, Wiley and Sons
29 Copyright Teradata
30. Why Hadoop Is Not a Data Warehouse
30 Copyright Teradata
31. Words Matter!
q The meaning of data warehouse is
changing:
q JSON (hierarchical capability)
q Network queries (possibly offload)
q Analytics
q The meaning of data warehouse is
extending. But it still includes
“optimization.”
q It’s no longer a data staging area, it’s
a reservoir.
31 Copyright Teradata
35. Data Lake Benefits: The Landing Zone
• Rapid ingest
> File copy vs database load
• Temporary data
• Data not ready for the
data warehouse
• Data that never à data
warehouse
• Archives
> Alternative to magnetic tape
35 Copyright Teradata
36. Hadoop Enables Another Data Platform
• Ad hoc projects
> One-shot complex analytics
> Hurry up, short term efforts
• Alternative analytics
> Not SQL-friendly algorithms
> Markov chains, random forest
> JPG, audio analysis
• Sandbox – hunting in the dark
> Prototyping
> Data exploration
> Trial and error new algorithms
36 Copyright Teradata
37. What Hadoop Is For!
q Data reservoir
q Prototyping
q Analytical or BI sandboxing (data
wrangling)
q Archive
q File system API (HDFS)
37 Copyright Teradata
38. Marketing
Applications
Business
Intelligence
Data
Mining
Math
and Stats
Languages
ANALYTIC
TOOLS & APPS
Customers
Partners
Business
Analysts
Data
Scientists
USERS
TERADATA UNIFIED DATA ARCHITECTURE
MOVE MANAGE ACCESS
INTEGRATED DATA WAREHOUSE
INTEGRATED DISCOVERY
PLATFORM
ERP
SCM
CRM
Images
Audio
and Video
Machine
Logs
Text
Web and
Social
SOURCES
DATA
PLATFORM
System Conceptual View
Marketing
Executives
Operational
Systems
Frontline
Workers
Engineers
TERADATA
DATABASE
HORTONWORKS
TERADATA DATABASE
TERADATA ASTER DATABASE
39. Teradata SQL-H Teradata SQL-H
• Joint R&D with Hortonworks
> Donated to Apache
• Business user query with
favorite BI tools
• Join Hadoop data to
> Teradata Data Warehouse
> Aster Discovery Platform
• Teradata 15.0
> Bi-directional SQL
> Push down filters to Hive
• Fast, secure, reliable
39 Copyright Teradata
Aster SQL-H
Hadoop
MR
Hive
Pig
HCatalog
Hadoop Layer: HDFS
Data
Data Filtering
40. Teradata 15: Teradata QueryGrid™
Business users Data Scientists
TERADATA
ASTER
DATABASE
SQL,
SQL-MR,
SQL-GR
TERADATA
DATABASE
Teradata
Systems
40 Copyright Teradata
OTHER
DATABASES
Remote
Data
LANGUAGES
SAS, Perl,
Python, R,
Ruby, etc,
HADOOP
Push-down
to Hadoop
IDW Discovery
TERADATA
DATABASE
TERADATA
ASTER
DATABASE
41. Market Possibilities
q The scale-out file system will not die
(because it’s only an API)
q YARN (& Cascading) will prosper
q Hadoop will play a role in data flow
q It will never replace the EDW,
except by deception
q The struggle for a unified
architecture will continue
41 Copyright Teradata
42. Hadoop and the Data Warehouse:
Competitive or Complementary?
http://www.teradata.com/analyst-reports/Hadoop-and-the-Data-Warehouse-Competitive-or-Complementary/
42 Copyright Teradata