SlideShare a Scribd company logo
1 of 45
Download to read offline
Grab some coffee and enjoy 
the pre-show banter before 
the top of the hour!
Hadoop and the Data Warehouse: Point/Counter Point 
The Briefing Room
Twitter Tag: #briefr 
The Briefing Room 
Welcome 
Host: 
Eric Kavanagh 
eric.kavanagh@bloorgroup.com 
@eric_kavanagh
! Reveal the essential characteristics of enterprise software, 
good and bad 
! Provide a forum for detailed analysis of today’s innovative 
technologies 
! Give vendors a chance to explain their product to savvy 
analysts 
! Allow audience members to pose serious questions... and get 
answers! 
Twitter Tag: #briefr 
The Briefing Room 
Mission
Twitter Tag: #briefr 
The Briefing Room 
Topics 
This Month: BIG DATA 
May: DATABASE 
June: ANALYTICS & MACHINE LEARNING 
2014 Editorial Calendar at 
www.insideanalysis.com/webcasts/the-briefing-room
Twitter Tag: #briefr 
The Briefing Room
Twitter Tag: #briefr 
The Briefing Room 
Analyst: Robin Bloor 
Robin Bloor is 
Chief Analyst at 
The Bloor Group 
robin.bloor@bloorgroup.com 
@robinbloor
Twitter Tag: #briefr 
The Briefing Room 
Teradata 
! Teradata is known for its analytics data solutions with a 
focus on integrated data warehousing, big data analytics 
and business applications 
! It offers a broad suite of technology platforms and solutions 
and a wide range of data management applications 
! Teradata’s SQL-H allows users and applications to join 
Hadoop data to the Teradata Data Warehouse and the Aster 
Discovery Platform
Twitter Tag: #briefr 
The Briefing Room 
Guest: Dan Graham 
Dan Graham is the Technical Marketing 
Director for Teradata. With over 30 years in IT, 
Dan joined Teradata Corporation in 1989 
where he was the senior product manager for 
the DBC/1012 parallel database computer. He 
then joined IBM where he wrote product plans 
and launched the RS/6000 SP parallel server. 
He then became Strategy Executive for IBM’s 
Global Business Intelligence Solutions. As 
Enterprise Systems General Manager at 
Teradata, Dan was responsible for strategy, 
go-to-market success, and competitive 
differentiation for the Active Enterprise Data 
Warehouse platform. He currently leads 
Teradata’s technical marketing activities.
HADOOP AND 
THE DATA WAREHOUSE 
Point, Counterpoint 
Myths and Magic
AGENDA 
• Words and modern terminology 
• Is Hadoop a data integration product? 
• Is Hadoop a data warehouse? 
• Hadoop – the magic 
• Teradata and Hadoop 
http://www.teradata.com/analyst-reports/Hadoop-and-the-Data-Warehouse-Competitive-or-Complementary/ 
11 Copyright Teradata
Our Host, the Word Smith 
@RobinBloor 
12 Copyright Teradata
13 Copyright Teradata
Now we have something that can provide 
us “Real-time” in Hadoop 
• At least most of the time 
• Queries are significantly faster but not 
always instantaneous 
> Simple selects à A couple of seconds 
> Join queries à 10s of seconds 
Source: Slideshare, Real Time Interactive Queries IN HADOOP: 
Big Data Warehousing Meetup, June 2013 
14 Copyright Teradata
Term Hadoop meaning BI/DW meaning 
Real time 
query 
• Self- service interactive 
queries that run in under 
minutes, preferably < 10s 
of seconds 
• Query responses in milliseconds 
• +Advanced query prioritization 
SQL 
• Subset of ANSI 92 SQL 
• Primary data types 
• UDFs 
• ANSI 2008 SQL + 
• Some/all ANSI SQL 2011 
• All SQL data types 
• Integrity constraints, window 
15 Copyright Teradata 
functions, UDFs, triggers, XML 
• ACID transactions (start 
transaction, commit, rollback) 
• Geospatial, temporal 
OLAP • Any query < 10 seconds 
• Subsecond multi-dimensional 
aggregate queries 
• Roll-up, drill-down hierarchies 
• MOLAP and ROLAP 
Hadoop Translator 
See: Wikipedia
Shoop, Shoop Hadoop! 
q Real-Time Query: Real-Time is really business 
time. It is almost always performance critical 
(otherwise why would you engineer for it?). 
q SQL sophistication depends on what you want to 
use it for. SQL-92 is rather primitive. There are 
consequences – performance consequences. 
q The appropriateness of Hadoop Interactive (OLAP) 
capability is user dependent. But why would you 
use Hadoop for this? 
16 Copyright Teradata
Current HDFS Availability & Data Integrity 
• Simple design, storage fault tolerance 
> Storage: Rely in OS’s file system rather than use raw disk 
> Storage Fault Tolerance: multiple replicas, active monitoring 
> Single NameNode Master 
– Persistent state: multiple copies + checkpoints 
– Restart on failure 
• How well did it work? 
> Lost 19 out of 329 Million blocks on 10 clusters with 20K 
nodes in 2009 
– 7-9’s of reliability 
– Fixed in 20 and 21. 
> 18 months Study: 22 failures on 25 clusters - 0.58 failures 
per year per cluster 
– Only 8 would have benefitted from HA failover!! (0.23 failures 
per cluster year) 
> NN is very robust and can take a lot of abuse 
– NN is resilient against overload caused by misbehaving apps 
Source: Slideshare, NameNode HA, 2011 
17 Copyright Teradata
Term Hadoop 
meaning 
18 Copyright Teradata 
BI/DW meaning 
High 
Availability 
• Data replication 
• Name node fail 
over 
• Redundant access paths (network, 
nodes, disks) 
• RAID storage, high quality hardware 
• Minimized planned downtime 
• No single point of failure 
• HA administration tools, event alerts 
tracking and auto recovery 
• Backups 
Fault tolerant 
Query automatically 
restarts on another 
node without 
resubmission using 
replicated data 
• Nonstop system (no unplanned 
system halt or reboot) 
• Extreme hardware reliability 
• 99.999% uptime 
• Fault isolation and containment 
• Graceful degradation 
• Rolling upgrades 
Hadoop Translator
Hadoop Falling Over! 
q Hadoop was built for the recovery of large batch 
on large commodity grids. 
q The goal was not to lose the work 
q This is really about disk failure 
q HA/FT is always configured according to workload 
characteristics. Enterprise HA is best thought of 
as “transactional” and OLTP, at the least, if not a 
real-time event. 
19 Copyright Teradata
Is Hadoop a Data Integration Platform? 
• Yes 
> “Lots of customers doing 
ETL in Hadoop” 
> Data refineries 
> Unstructured data 
– Weblogs and sensor data 
> Data Hub/Data Lake 
• No 
> No built-ins 
20 Copyright Teradata 
– Data quality tools 
– Transformations 
> All do-it-yourself code 
> No ETL process 
management 
> No metadata repository
Hadoop Is Not a Data Integration Solution 
• Data integration requires a method for rationalizing inconsistent 
semantics, which helps developers rationalize various sources of 
data (depending on some of the metadata and policy capabilities 
that are entirely absent from the Hadoop stack). 
• Data quality is a key component of any appropriately governed 
data integration project. The Hadoop stack offers no support 
for this, other than the individual programmer's code, one 
data element at a time, or one program at a time. 
• Because Hadoop work streams are independent — and separately 
programmed for specific use cases — there is no method for 
relating one to another, nor for identifying or reconciling underlying 
semantic differences. 
29 January 2013 
21 Copyright Teradata
Unstructured Data in the Data Warehouse 
• Facebook, Twitter, LinkedIn 
• Sensor data 
• XML 
• Web logs 
• JSON 
• eMail 
• Documents 
• Images 
• Not so much 
> Audio 
> Video 
20% 
15% 
10% 
5% 
0% 
2013 
Social XML Docs eMail Web JPGs A/V Sensor 
22 Copyright Teradata 
logs 
Sources: Derived from TDWI, Wikibon, Gartner, IDC
Hadoop For Data Integration! 
q Hadoop serves a useful function as a 
data reservoir. 
q The revenge of the ISAM file 
q Some ETL 
q Some cleansing 
q Some analytics 
q Personally, I would want drag and drop 
ETL, ELT. Those who write code 
maintain code. 
23 Copyright Teradata
Is Hadoop a Data Warehouse? 
24 Copyright Teradata
Scaling the Facebook Data 
Warehouse to 300 PB 
At Facebook, we have unique storage scalability challenges when it comes to our 
data warehouse. Our warehouse stores upwards of 300 PB of Hive data, 
with an incoming daily rate of about 600 TB. In the last year, the warehouse has 
seen a 3x growth in the amount of data stored. Given this growth trajectory, 
storage efficiency is and will continue to be a focus for our warehouse 
infrastructure. 
There are many areas we are innovating in to improve storage efficiency for the 
warehouse – building cold storage data centers, adopting techniques like RAID in 
HDFS to reduce replication ratios (while maintaining high availability), and using 
compression for data reduction before it’s written to HDFS. The most widely used 
system at Facebook for large data transformations on raw logs is Hive, a query 
engine based on Corona Map Reduce used for processing and creating large 
tables in our data warehouse. In this post, we will focus primarily on how we 
evolved the Hive storage format to compress raw data as efficiently as possible 
into the on-disk data format. 
https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ 
25 Copyright Teradata 
April 10, 2014
What is a Data Warehouse? 
• A data design pattern, an architecture 
> Size doesn’t matter 
> A perpetual evolution 
• Definition: Gartner (2005) /Inmon (1992) /Wikipedia 
> Subject oriented 
– Detailed data + modeling of sales, inventory, finance, etc. 
> Integrated logical model 
– Merged data 
– Consistent, standardized data formats and values 
> Nonvolatile 
– Data stored unmodified for long periods of time 
> Time variant 
– Record versioning or temporal services 
> Persistent storage, not virtual, not federated 
Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘; 
Bill Inmon, Building the Data Warehouse, 1992, Wiley and Sons 
26 Copyright Teradata
Subject Areas: A Model of ‘Our’ Business 
Price 
history 
Point of Sale 
Product/Services 
Inventory 
Supplier 
Contracts 
27 Copyright Teradata 
Labor 
Associate 
E-Commerce 
Channels 
Customer 
Sales 
transactions 
Carrier Shipment 
Campaigns 
Promotion 
Warehouse 
Each subject area has numerous large FACT tables (=big joins)
You Wish You Had Redundant Data! 
Match keys 
App Cust_ID First Last DOB Social Address 
ERP 30391-244 William Franks 04/12/00 563-49-1234 123 Oak, Atlanta 
CRM 30391244 W. Franks 04/12/70 563491234 
SCM 30391244 Bill Franks 04/12/70 Atlanta 
XYZ 30391-244 Frank Williams 563491234 123 Oak St. #14 
Cust_ID First Last DOB Social Address 
30391244 William Franks 04/12/70 563491234 123 Oak St. #14 
Final integrated record 
28 Copyright Teradata 
ETL
What is a Data Mart? 
• A targeted project that will be finished 
> A subset of data, not all the data 
> Not for all of the people 
• Often heavily denormalized 
• Volatility 
> Often completely reloaded 
• Time variance and currency 
> Can restate the data “as of” a point in time 
• Virtualization option 
> Can be a logical set of views, cubes 
Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘; 
Inmon, Building the Data Warehouse, 1992, Wiley and Sons 
29 Copyright Teradata
Why Hadoop Is Not a Data Warehouse 
30 Copyright Teradata
Words Matter! 
q The meaning of data warehouse is 
changing: 
q JSON (hierarchical capability) 
q Network queries (possibly offload) 
q Analytics 
q The meaning of data warehouse is 
extending. But it still includes 
“optimization.” 
q It’s no longer a data staging area, it’s 
a reservoir. 
31 Copyright Teradata
HADOOP MAGIC 
Where Hadoop Excels
What is Hadoop? 
• Apache Hadoop framework 
> Hadoop Common 
– Libraries and utilities 
> Hadoop Distributed File System (HDFS) 
> Hadoop YARN 
> Hadoop MapReduce 
• Looks like Hadoop too 
> Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, 
ZooKeeper, Oozie, Sqoop, Tez 
• SQL on Hadoop 
o Presto, Drill, Shark, Impala, Hawq, JAQL 
• New Hadoop modules or replacements 
o GPFS, Mesos, Spark, Storm, Accumulo, Sentry, Falcon, 
Knox, Whirr , Sentry, Tachyon, SOLR, Lucene 
33 Copyright Teradata
Hadoop 2.0 
Applica'ons 
Run 
Na'vely 
IN 
Hadoop 
INTERACTIVE 
(Tez) 
STREAMING 
(Storm) 
YARN (Cluster Resource Management) 
HDFS2 (Redundant, Reliable Storage) 
BATCH 
(MapReduce) 
34 Copyright Teradata 
GRAPH 
(Giraph) 
IN-­‐MEMORY 
(Spark) 
HPC 
MPI 
(OpenMPI) 
ONLINE 
(HBase) 
OTHER 
(ex. 
Search) 
Hadoop 2.0
Data Lake Benefits: The Landing Zone 
• Rapid ingest 
> File copy vs database load 
• Temporary data 
• Data not ready for the 
data warehouse 
• Data that never à data 
warehouse 
• Archives 
> Alternative to magnetic tape 
35 Copyright Teradata
Hadoop Enables Another Data Platform 
• Ad hoc projects 
> One-shot complex analytics 
> Hurry up, short term efforts 
• Alternative analytics 
> Not SQL-friendly algorithms 
> Markov chains, random forest 
> JPG, audio analysis 
• Sandbox – hunting in the dark 
> Prototyping 
> Data exploration 
> Trial and error new algorithms 
36 Copyright Teradata
What Hadoop Is For! 
q Data reservoir 
q Prototyping 
q Analytical or BI sandboxing (data 
wrangling) 
q Archive 
q File system API (HDFS) 
37 Copyright Teradata
Marketing 
Applications 
Business 
Intelligence 
Data 
Mining 
Math 
and Stats 
Languages 
ANALYTIC 
TOOLS & APPS 
Customers 
Partners 
Business 
Analysts 
Data 
Scientists 
USERS 
TERADATA UNIFIED DATA ARCHITECTURE 
MOVE MANAGE ACCESS 
INTEGRATED DATA WAREHOUSE 
INTEGRATED DISCOVERY 
PLATFORM 
ERP 
SCM 
CRM 
Images 
Audio 
and Video 
Machine 
Logs 
Text 
Web and 
Social 
SOURCES 
DATA 
PLATFORM 
System Conceptual View 
Marketing 
Executives 
Operational 
Systems 
Frontline 
Workers 
Engineers 
TERADATA 
DATABASE 
HORTONWORKS 
TERADATA DATABASE 
TERADATA ASTER DATABASE
Teradata SQL-H Teradata SQL-H 
• Joint R&D with Hortonworks 
> Donated to Apache 
• Business user query with 
favorite BI tools 
• Join Hadoop data to 
> Teradata Data Warehouse 
> Aster Discovery Platform 
• Teradata 15.0 
> Bi-directional SQL 
> Push down filters to Hive 
• Fast, secure, reliable 
39 Copyright Teradata 
Aster SQL-H 
Hadoop 
MR 
Hive 
Pig 
HCatalog 
Hadoop Layer: HDFS 
Data 
Data Filtering
Teradata 15: Teradata QueryGrid™ 
Business users Data Scientists 
TERADATA 
ASTER 
DATABASE 
SQL, 
SQL-MR, 
SQL-GR 
TERADATA 
DATABASE 
Teradata 
Systems 
40 Copyright Teradata 
OTHER 
DATABASES 
Remote 
Data 
LANGUAGES 
SAS, Perl, 
Python, R, 
Ruby, etc, 
HADOOP 
Push-down 
to Hadoop 
IDW Discovery 
TERADATA 
DATABASE 
TERADATA 
ASTER 
DATABASE
Market Possibilities 
q The scale-out file system will not die 
(because it’s only an API) 
q YARN (& Cascading) will prosper 
q Hadoop will play a role in data flow 
q It will never replace the EDW, 
except by deception 
q The struggle for a unified 
architecture will continue 
41 Copyright Teradata
Hadoop and the Data Warehouse: 
Competitive or Complementary? 
http://www.teradata.com/analyst-reports/Hadoop-and-the-Data-Warehouse-Competitive-or-Complementary/ 
42 Copyright Teradata
Twitter Tag: #briefr 
The Briefing Room
This Month: BIG DATA 
May: DATABASE 
June: ANALYTICS & MACHINE LEARNING 
www.insideanalysis.com/webcasts/the-briefing-room 
Twitter Tag: #briefr 
The Briefing Room 
Upcoming Topics 
2014 Editorial Calendar at 
www.insideanalysis.com
Twitter Tag: #briefr 
THANK YOU 
for your 
ATTENTION! 
The Briefing Room

More Related Content

What's hot

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Brock Noland
 

What's hot (19)

A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to Finance
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Software Architecture and Predictive Models in R
Software Architecture and Predictive Models in RSoftware Architecture and Predictive Models in R
Software Architecture and Predictive Models in R
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Big Data & SQL: The On-Ramp to Hadoop
Big Data & SQL: The On-Ramp to Hadoop Big Data & SQL: The On-Ramp to Hadoop
Big Data & SQL: The On-Ramp to Hadoop
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
 

Viewers also liked

Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Inside Analysis
 
MET-0235-DanGraham_NYMag_4.625x7.25_Jun30_v1_A
MET-0235-DanGraham_NYMag_4.625x7.25_Jun30_v1_AMET-0235-DanGraham_NYMag_4.625x7.25_Jun30_v1_A
MET-0235-DanGraham_NYMag_4.625x7.25_Jun30_v1_A
Hyla Skopitz
 
Building an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-MakingBuilding an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-Making
Denodo
 
Attention
Attention Attention
Attention
gsjus
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

Viewers also liked (19)

Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
 
NHS innovation diffusion - from Deleuze & Guattari to Digital Movements
NHS innovation diffusion - from Deleuze & Guattari to Digital MovementsNHS innovation diffusion - from Deleuze & Guattari to Digital Movements
NHS innovation diffusion - from Deleuze & Guattari to Digital Movements
 
MET-0235-DanGraham_NYMag_4.625x7.25_Jun30_v1_A
MET-0235-DanGraham_NYMag_4.625x7.25_Jun30_v1_AMET-0235-DanGraham_NYMag_4.625x7.25_Jun30_v1_A
MET-0235-DanGraham_NYMag_4.625x7.25_Jun30_v1_A
 
5 Years of Progress in Active Data Warehousing
5 Years of Progress in Active Data Warehousing5 Years of Progress in Active Data Warehousing
5 Years of Progress in Active Data Warehousing
 
Conceptual framing for educational research through Deleuze and Guattari
Conceptual framing for educational research through Deleuze and GuattariConceptual framing for educational research through Deleuze and Guattari
Conceptual framing for educational research through Deleuze and Guattari
 
An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
 
JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)
 
Building an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-MakingBuilding an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-Making
 
Presentation Skills
Presentation Skills Presentation Skills
Presentation Skills
 
A Bigger Magnifying Glass: Analyzing the Internet of Things
A Bigger Magnifying Glass: Analyzing the Internet of Things	A Bigger Magnifying Glass: Analyzing the Internet of Things
A Bigger Magnifying Glass: Analyzing the Internet of Things
 
Making the leap to BI on Hadoop by Mariani, dave @ atscale
Making the leap to BI on Hadoop by Mariani, dave @ atscaleMaking the leap to BI on Hadoop by Mariani, dave @ atscale
Making the leap to BI on Hadoop by Mariani, dave @ atscale
 
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
 
Peter eisenman
Peter eisenman  Peter eisenman
Peter eisenman
 
Peter eisenman
Peter eisenmanPeter eisenman
Peter eisenman
 
Attention
Attention Attention
Attention
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Attention
AttentionAttention
Attention
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidData virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss Teiid
 
Attention
AttentionAttention
Attention
 

Similar to Hadoop and the Data Warehouse: Point/Counter Point

Hadoop is not an Island in the Enterprise
Hadoop is not an Island in the EnterpriseHadoop is not an Island in the Enterprise
Hadoop is not an Island in the Enterprise
DataWorks Summit
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 

Similar to Hadoop and the Data Warehouse: Point/Counter Point (20)

Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution StrategyEnterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Hadoop is not an Island in the Enterprise
Hadoop is not an Island in the EnterpriseHadoop is not an Island in the Enterprise
Hadoop is not an Island in the Enterprise
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsHow Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT Analytics
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio..."Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
 
The Anywhere Enterprise – How a Flexible Foundation Opens Doors
The Anywhere Enterprise – How a Flexible Foundation Opens DoorsThe Anywhere Enterprise – How a Flexible Foundation Opens Doors
The Anywhere Enterprise – How a Flexible Foundation Opens Doors
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 

More from Inside Analysis

Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
Inside Analysis
 

More from Inside Analysis (20)

Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time Analytics
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of Everything
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global Level
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your Architecture
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the Risk
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave Duggal
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
 
Phasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey MalafskyPhasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey Malafsky
 
Red Hat - Sarangan Rangachari
Red Hat - Sarangan RangachariRed Hat - Sarangan Rangachari
Red Hat - Sarangan Rangachari
 
WebAction-Sami Abkay
WebAction-Sami AbkayWebAction-Sami Abkay
WebAction-Sami Abkay
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Hadoop and the Data Warehouse: Point/Counter Point

  • 1. Grab some coffee and enjoy the pre-show banter before the top of the hour!
  • 2. Hadoop and the Data Warehouse: Point/Counter Point The Briefing Room
  • 3. Twitter Tag: #briefr The Briefing Room Welcome Host: Eric Kavanagh eric.kavanagh@bloorgroup.com @eric_kavanagh
  • 4. ! Reveal the essential characteristics of enterprise software, good and bad ! Provide a forum for detailed analysis of today’s innovative technologies ! Give vendors a chance to explain their product to savvy analysts ! Allow audience members to pose serious questions... and get answers! Twitter Tag: #briefr The Briefing Room Mission
  • 5. Twitter Tag: #briefr The Briefing Room Topics This Month: BIG DATA May: DATABASE June: ANALYTICS & MACHINE LEARNING 2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
  • 6. Twitter Tag: #briefr The Briefing Room
  • 7. Twitter Tag: #briefr The Briefing Room Analyst: Robin Bloor Robin Bloor is Chief Analyst at The Bloor Group robin.bloor@bloorgroup.com @robinbloor
  • 8. Twitter Tag: #briefr The Briefing Room Teradata ! Teradata is known for its analytics data solutions with a focus on integrated data warehousing, big data analytics and business applications ! It offers a broad suite of technology platforms and solutions and a wide range of data management applications ! Teradata’s SQL-H allows users and applications to join Hadoop data to the Teradata Data Warehouse and the Aster Discovery Platform
  • 9. Twitter Tag: #briefr The Briefing Room Guest: Dan Graham Dan Graham is the Technical Marketing Director for Teradata. With over 30 years in IT, Dan joined Teradata Corporation in 1989 where he was the senior product manager for the DBC/1012 parallel database computer. He then joined IBM where he wrote product plans and launched the RS/6000 SP parallel server. He then became Strategy Executive for IBM’s Global Business Intelligence Solutions. As Enterprise Systems General Manager at Teradata, Dan was responsible for strategy, go-to-market success, and competitive differentiation for the Active Enterprise Data Warehouse platform. He currently leads Teradata’s technical marketing activities.
  • 10. HADOOP AND THE DATA WAREHOUSE Point, Counterpoint Myths and Magic
  • 11. AGENDA • Words and modern terminology • Is Hadoop a data integration product? • Is Hadoop a data warehouse? • Hadoop – the magic • Teradata and Hadoop http://www.teradata.com/analyst-reports/Hadoop-and-the-Data-Warehouse-Competitive-or-Complementary/ 11 Copyright Teradata
  • 12. Our Host, the Word Smith @RobinBloor 12 Copyright Teradata
  • 14. Now we have something that can provide us “Real-time” in Hadoop • At least most of the time • Queries are significantly faster but not always instantaneous > Simple selects à A couple of seconds > Join queries à 10s of seconds Source: Slideshare, Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup, June 2013 14 Copyright Teradata
  • 15. Term Hadoop meaning BI/DW meaning Real time query • Self- service interactive queries that run in under minutes, preferably < 10s of seconds • Query responses in milliseconds • +Advanced query prioritization SQL • Subset of ANSI 92 SQL • Primary data types • UDFs • ANSI 2008 SQL + • Some/all ANSI SQL 2011 • All SQL data types • Integrity constraints, window 15 Copyright Teradata functions, UDFs, triggers, XML • ACID transactions (start transaction, commit, rollback) • Geospatial, temporal OLAP • Any query < 10 seconds • Subsecond multi-dimensional aggregate queries • Roll-up, drill-down hierarchies • MOLAP and ROLAP Hadoop Translator See: Wikipedia
  • 16. Shoop, Shoop Hadoop! q Real-Time Query: Real-Time is really business time. It is almost always performance critical (otherwise why would you engineer for it?). q SQL sophistication depends on what you want to use it for. SQL-92 is rather primitive. There are consequences – performance consequences. q The appropriateness of Hadoop Interactive (OLAP) capability is user dependent. But why would you use Hadoop for this? 16 Copyright Teradata
  • 17. Current HDFS Availability & Data Integrity • Simple design, storage fault tolerance > Storage: Rely in OS’s file system rather than use raw disk > Storage Fault Tolerance: multiple replicas, active monitoring > Single NameNode Master – Persistent state: multiple copies + checkpoints – Restart on failure • How well did it work? > Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 – 7-9’s of reliability – Fixed in 20 and 21. > 18 months Study: 22 failures on 25 clusters - 0.58 failures per year per cluster – Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year) > NN is very robust and can take a lot of abuse – NN is resilient against overload caused by misbehaving apps Source: Slideshare, NameNode HA, 2011 17 Copyright Teradata
  • 18. Term Hadoop meaning 18 Copyright Teradata BI/DW meaning High Availability • Data replication • Name node fail over • Redundant access paths (network, nodes, disks) • RAID storage, high quality hardware • Minimized planned downtime • No single point of failure • HA administration tools, event alerts tracking and auto recovery • Backups Fault tolerant Query automatically restarts on another node without resubmission using replicated data • Nonstop system (no unplanned system halt or reboot) • Extreme hardware reliability • 99.999% uptime • Fault isolation and containment • Graceful degradation • Rolling upgrades Hadoop Translator
  • 19. Hadoop Falling Over! q Hadoop was built for the recovery of large batch on large commodity grids. q The goal was not to lose the work q This is really about disk failure q HA/FT is always configured according to workload characteristics. Enterprise HA is best thought of as “transactional” and OLTP, at the least, if not a real-time event. 19 Copyright Teradata
  • 20. Is Hadoop a Data Integration Platform? • Yes > “Lots of customers doing ETL in Hadoop” > Data refineries > Unstructured data – Weblogs and sensor data > Data Hub/Data Lake • No > No built-ins 20 Copyright Teradata – Data quality tools – Transformations > All do-it-yourself code > No ETL process management > No metadata repository
  • 21. Hadoop Is Not a Data Integration Solution • Data integration requires a method for rationalizing inconsistent semantics, which helps developers rationalize various sources of data (depending on some of the metadata and policy capabilities that are entirely absent from the Hadoop stack). • Data quality is a key component of any appropriately governed data integration project. The Hadoop stack offers no support for this, other than the individual programmer's code, one data element at a time, or one program at a time. • Because Hadoop work streams are independent — and separately programmed for specific use cases — there is no method for relating one to another, nor for identifying or reconciling underlying semantic differences. 29 January 2013 21 Copyright Teradata
  • 22. Unstructured Data in the Data Warehouse • Facebook, Twitter, LinkedIn • Sensor data • XML • Web logs • JSON • eMail • Documents • Images • Not so much > Audio > Video 20% 15% 10% 5% 0% 2013 Social XML Docs eMail Web JPGs A/V Sensor 22 Copyright Teradata logs Sources: Derived from TDWI, Wikibon, Gartner, IDC
  • 23. Hadoop For Data Integration! q Hadoop serves a useful function as a data reservoir. q The revenge of the ISAM file q Some ETL q Some cleansing q Some analytics q Personally, I would want drag and drop ETL, ELT. Those who write code maintain code. 23 Copyright Teradata
  • 24. Is Hadoop a Data Warehouse? 24 Copyright Teradata
  • 25. Scaling the Facebook Data Warehouse to 300 PB At Facebook, we have unique storage scalability challenges when it comes to our data warehouse. Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB. In the last year, the warehouse has seen a 3x growth in the amount of data stored. Given this growth trajectory, storage efficiency is and will continue to be a focus for our warehouse infrastructure. There are many areas we are innovating in to improve storage efficiency for the warehouse – building cold storage data centers, adopting techniques like RAID in HDFS to reduce replication ratios (while maintaining high availability), and using compression for data reduction before it’s written to HDFS. The most widely used system at Facebook for large data transformations on raw logs is Hive, a query engine based on Corona Map Reduce used for processing and creating large tables in our data warehouse. In this post, we will focus primarily on how we evolved the Hive storage format to compress raw data as efficiently as possible into the on-disk data format. https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ 25 Copyright Teradata April 10, 2014
  • 26. What is a Data Warehouse? • A data design pattern, an architecture > Size doesn’t matter > A perpetual evolution • Definition: Gartner (2005) /Inmon (1992) /Wikipedia > Subject oriented – Detailed data + modeling of sales, inventory, finance, etc. > Integrated logical model – Merged data – Consistent, standardized data formats and values > Nonvolatile – Data stored unmodified for long periods of time > Time variant – Record versioning or temporal services > Persistent storage, not virtual, not federated Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘; Bill Inmon, Building the Data Warehouse, 1992, Wiley and Sons 26 Copyright Teradata
  • 27. Subject Areas: A Model of ‘Our’ Business Price history Point of Sale Product/Services Inventory Supplier Contracts 27 Copyright Teradata Labor Associate E-Commerce Channels Customer Sales transactions Carrier Shipment Campaigns Promotion Warehouse Each subject area has numerous large FACT tables (=big joins)
  • 28. You Wish You Had Redundant Data! Match keys App Cust_ID First Last DOB Social Address ERP 30391-244 William Franks 04/12/00 563-49-1234 123 Oak, Atlanta CRM 30391244 W. Franks 04/12/70 563491234 SCM 30391244 Bill Franks 04/12/70 Atlanta XYZ 30391-244 Frank Williams 563491234 123 Oak St. #14 Cust_ID First Last DOB Social Address 30391244 William Franks 04/12/70 563491234 123 Oak St. #14 Final integrated record 28 Copyright Teradata ETL
  • 29. What is a Data Mart? • A targeted project that will be finished > A subset of data, not all the data > Not for all of the people • Often heavily denormalized • Volatility > Often completely reloaded • Time variance and currency > Can restate the data “as of” a point in time • Virtualization option > Can be a logical set of views, cubes Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘; Inmon, Building the Data Warehouse, 1992, Wiley and Sons 29 Copyright Teradata
  • 30. Why Hadoop Is Not a Data Warehouse 30 Copyright Teradata
  • 31. Words Matter! q The meaning of data warehouse is changing: q JSON (hierarchical capability) q Network queries (possibly offload) q Analytics q The meaning of data warehouse is extending. But it still includes “optimization.” q It’s no longer a data staging area, it’s a reservoir. 31 Copyright Teradata
  • 32. HADOOP MAGIC Where Hadoop Excels
  • 33. What is Hadoop? • Apache Hadoop framework > Hadoop Common – Libraries and utilities > Hadoop Distributed File System (HDFS) > Hadoop YARN > Hadoop MapReduce • Looks like Hadoop too > Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, ZooKeeper, Oozie, Sqoop, Tez • SQL on Hadoop o Presto, Drill, Shark, Impala, Hawq, JAQL • New Hadoop modules or replacements o GPFS, Mesos, Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr , Sentry, Tachyon, SOLR, Lucene 33 Copyright Teradata
  • 34. Hadoop 2.0 Applica'ons Run Na'vely IN Hadoop INTERACTIVE (Tez) STREAMING (Storm) YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) BATCH (MapReduce) 34 Copyright Teradata GRAPH (Giraph) IN-­‐MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (ex. Search) Hadoop 2.0
  • 35. Data Lake Benefits: The Landing Zone • Rapid ingest > File copy vs database load • Temporary data • Data not ready for the data warehouse • Data that never à data warehouse • Archives > Alternative to magnetic tape 35 Copyright Teradata
  • 36. Hadoop Enables Another Data Platform • Ad hoc projects > One-shot complex analytics > Hurry up, short term efforts • Alternative analytics > Not SQL-friendly algorithms > Markov chains, random forest > JPG, audio analysis • Sandbox – hunting in the dark > Prototyping > Data exploration > Trial and error new algorithms 36 Copyright Teradata
  • 37. What Hadoop Is For! q Data reservoir q Prototyping q Analytical or BI sandboxing (data wrangling) q Archive q File system API (HDFS) 37 Copyright Teradata
  • 38. Marketing Applications Business Intelligence Data Mining Math and Stats Languages ANALYTIC TOOLS & APPS Customers Partners Business Analysts Data Scientists USERS TERADATA UNIFIED DATA ARCHITECTURE MOVE MANAGE ACCESS INTEGRATED DATA WAREHOUSE INTEGRATED DISCOVERY PLATFORM ERP SCM CRM Images Audio and Video Machine Logs Text Web and Social SOURCES DATA PLATFORM System Conceptual View Marketing Executives Operational Systems Frontline Workers Engineers TERADATA DATABASE HORTONWORKS TERADATA DATABASE TERADATA ASTER DATABASE
  • 39. Teradata SQL-H Teradata SQL-H • Joint R&D with Hortonworks > Donated to Apache • Business user query with favorite BI tools • Join Hadoop data to > Teradata Data Warehouse > Aster Discovery Platform • Teradata 15.0 > Bi-directional SQL > Push down filters to Hive • Fast, secure, reliable 39 Copyright Teradata Aster SQL-H Hadoop MR Hive Pig HCatalog Hadoop Layer: HDFS Data Data Filtering
  • 40. Teradata 15: Teradata QueryGrid™ Business users Data Scientists TERADATA ASTER DATABASE SQL, SQL-MR, SQL-GR TERADATA DATABASE Teradata Systems 40 Copyright Teradata OTHER DATABASES Remote Data LANGUAGES SAS, Perl, Python, R, Ruby, etc, HADOOP Push-down to Hadoop IDW Discovery TERADATA DATABASE TERADATA ASTER DATABASE
  • 41. Market Possibilities q The scale-out file system will not die (because it’s only an API) q YARN (& Cascading) will prosper q Hadoop will play a role in data flow q It will never replace the EDW, except by deception q The struggle for a unified architecture will continue 41 Copyright Teradata
  • 42. Hadoop and the Data Warehouse: Competitive or Complementary? http://www.teradata.com/analyst-reports/Hadoop-and-the-Data-Warehouse-Competitive-or-Complementary/ 42 Copyright Teradata
  • 43. Twitter Tag: #briefr The Briefing Room
  • 44. This Month: BIG DATA May: DATABASE June: ANALYTICS & MACHINE LEARNING www.insideanalysis.com/webcasts/the-briefing-room Twitter Tag: #briefr The Briefing Room Upcoming Topics 2014 Editorial Calendar at www.insideanalysis.com
  • 45. Twitter Tag: #briefr THANK YOU for your ATTENTION! The Briefing Room