Hadoop: Extending your Data Warehouse

1
Hadoop: Extending Your Data Warehouse
Tony Baer | Principal Analyst, Ovum
Moderated by Matt Brandwein | Product Marketing Manager, Cloudera
May 9, 2013

Welcome to the webinar!
• All lines are muted
• Q&A after the presentation
• Ask questions at any time by typing them in the
“Questions” pane on your WebEx panel
• Recording of this webinar will be available
on-demand at cloudera.com
• Join the conversation on Twitter:
@cloudera @TonyBaer #EDWHadoop
2

Who is Cloudera?
3
What the Enterprise
Requires
 Only 100% open source
Hadoop-based platform
with both batch and real-
time processing engines,
enterprise-ready with
native high availability
 Suite of system and data
management software
 Comprehensive support
and consulting services
 Broadest Hadoop training
and certification programs
Extensive Partner
Ecosystem
 Over 600 partners across
hardware, software and
services
The Leader in
Big Data
Management
 Deliver a revolutionary
data management
platform powered by
Apache Hadoop
 World’s leading
commercial vendor of
Apache Hadoop
 Enable organizations to
improve operational
efficiency and Ask
Bigger Questions of all
their data
Customers & Users
Across Industries
 More production
deployments than all other
vendors combined

© Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.4
Hadoop: Extending your Data
Warehouse
Tony Baer
tony.baer@ovum.com
May 9, 2013
Twitter: @TonyBaer

© Copyright Ovum. All rights reserved. Ovum is an Informa business.5
 The BI Bottleneck
 Hadoop & Enterprise Data Warehousing strategy
 How Cloudera supports Hadoop as extended DW
Agenda

Sources Target(s)Staging
Server
Extract Transform Load
Data
Marts
DW
Traditional BI/Data warehousing architecture
ETL Tool

 DWs conceived for MBytes/GBytes of structured data
 Data structured based on expected queries & analytics
 Multiple tiers to separate distinct workloads
 OLTP – ongoing, shallow interactions, simple queries
 Transform – batch-oriented, IOPS-intensive
 BI/analytics – data-intensive, spikey
 Reduced, eliminated impact on OLTP
 More complex architecture, more tradeoffs
DW —
The base case

EDW hitting the wall
 Data growing in volume & complexity
 Use cases require more, richer data
 Customer retention
 Operational Efficiency
 Risk Mitigation
 Data retention mandates/policies
forcing hard decisions
 ETL bursting batch windows
 EDWs straining to accommodate
volumes, varieties of data

Sources Target(s)
Extract Load/Transform
DW
Data
Marts
The ELT pattern

The benefits – and limits – of ELT
 Pros
 Fewer data movements
 Flatter architecture
 Reduced errors with fewer data
movements
 Cons
 Transform vs. analytic workload
tradeoffs
 SLAs jeopardized
 Triggers arms race for more
infrastructure
Processing
Times
Infrastructure
CostsData
Volumes
Assuming constant SLAs

Enterprise DWs –
Size has its limits
 SLAs hit the wall
 Software licensing costs
 PBytes @ $20k - $50k/TByte get
$$$$$$
 Managing/transforming new data
types consumes resource

But what if...
 You don’t have to worry about batch
windows
 You don’t have to trade off
transformation vs. analytic processing
cycles
 You can control s/w license cost
escalation
 You can keep that archived data live
 You can more readily consume new
types of data & keep your analytic
options open

Agenda

Introducing Hadoop
 Originally, data processing framework for
solving unique Internet-scale problems
 Based on Google File System (GFS) &
MapReduce
 Apache Hadoop community emerged to
develop platform for wider scale adoption
 FS, telcos, retail media discovered Hadoop’s
benefits

Hadoop benefits
Scalability
Near linear
performance up to
1000s of nodes
Cost Flexibility
Leverages commodity
h/w & open source s/w
Versatility with data,
analytics & operation

Hadoop’s trump card —
Flexibility
 Accommodates all kinds of data
 Accommodates multiple
workloads
 Keeps your options open
 Extensibility
 Life beyond MapReduce
 Many personalities
 Best of both worlds
 Convergence with SQL
Get the best of both worlds

Sources Target
Extract Load/Transform
Data
Marts
Existing
DW/Data Mart
environment
Hadoop
DW
Hadoop as Data transformation platform

Why Hadoop as your data transformation platform?
 Inexpensive cycles/storage
 Low-cost platform reduces or eliminates tradeoff contingencies
 No more transformation vs. analytics choice
 Keep your archive active
 Flexible division of labor
 Data can remain in Hadoop or moved to SQL
 Raw data sits alongside transformed data

Why Hadoop as extension to your DW?
 Efficient division of labor
 Run time-consuming, resource-intensive analytic workloads inside
Hadoop
 Routine query, analytics, & reporting in SQL DW or data mart
 Query Hadoop directly
 Most commercial BI tools read Hive metadata
 Query Hadoop interactively
 Emerging MapReduce alternatives supporting interactive query

Agenda

Cloudera supports SQL convergence
 Partners with leading ETL, BI, and Data warehousing platform & tool
providers
 Connect Hadoop & SQL platforms
 Emerging trend: BI, ETL tools are working natively inside Hadoop
 Introducing Impala
 Brings high-performance interactive SQL inside Hadoop
 Turns Hadoop into an MPP SQL analytic data target
 Extends, doesn't replace your SQL EDW or data mart
 Makes your DW strategy more flexible, iterative

Taming Hadoop
 Cloudera Manager
 Automates deployment and health monitoring
 Automates Hadoop configuration
 New side-by-side deployment support
 Cloudera Navigator
 New feature of Cloudera Manager
 Tracks data utilization activity from HDFS, Hive & HBase
 Stepping stone for data security/stewardship… watch this space
 Backup & Disaster Recovery (BDR)
 New feature to automate recovery workflows

Hadoop –
Takeaways
 Economical platform for offloading data transformation cycles
 Extends enterprise analytics
 Hadoop & SQL are converging– broadening your analytic options
 Hadoop won’t replace your EDW, but will take more of the workload
 Cloudera actively broadening CDH to support & extend your EDW
 SQL convergence
 Platform manageability
 Data security & stewardship

Impala: Cloudera’s Design Strategy
24
Storage
Integration
Resource Management
Metadata
Batch
Processing
MAPREDUCE,
HIVE & PIG
…
Interactive
SQL
IMPALA
Math
Machine
Learning, Anal
ytics
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Engines
Complement MapReduce with
interactive MPP SQL engine
One pool of data
One metadata model
One security framework
One set of system resources
100% open source
An Integrated Part of the Hadoop Platform

Impala Use Cases
25
Interactive BI/analytics on more data
Asking new questions
Data processing with tight SLAs
Query-able archive w/ full fidelity
Cost-effective, ad hoc query environment that
offloads the data warehouse for:

Leading BI tools work with Impala
26

Questions?
27
• Type in the “Questions” panel
• Tweet @cloudera #EDWHadoop
• Recording will be available
on-demand at cloudera.com
• Contact us:
tony.baer@ovum.com
Twitter: @TonyBaer
mbrandwein@cloudera.com
Twitter: @MattBrandwein
Thank you for attending!
Try Cloudera today
cloudera.com/downloads
Learn more about Impala
cloudera.com/impala
Get Hadoop Training
university.cloudera.com
Ready to go?
Check out Cloudera Quickstart
cloudera.com/quickstart

Hadoop: Extending your Data Warehouse

Hadoop: Extending your Data Warehouse

More Related Content

What's hot

Similar to Hadoop: Extending your Data Warehouse

More from Cloudera, Inc.

Recently uploaded

Hadoop: Extending your Data Warehouse

Editor's Notes