Building a Hadoop Data Warehouse: Hadoop 101 for Enterprise Data Warehouse Professionals

Building a Hadoop Data
Warehouse
Hadoop 101 for enterprise
data warehouse professionals
Ralph Kimball
APRIL 2014
Building a Hadoop Data Warehouse
© Ralph Kimball, Cloudera, 2014
April 2014

The Data Warehouse Mission
 Identify all possible enterprise data assets
 Select those assets that have actionable
content and can be accessed
 Bring the data assets into a logically
centralized “enterprise data warehouse”
 Expose those data assets most effectively for
decision making

Enormous RDBMS Legacy
 Legacy RDBMSs have been spectacularly
successful, and we will continue to use them.
 Too successful… If all you have is a hammer,
everything looks like a nail.
 RDBMS dilemma: a new ocean of new data
types that are being monetized for strategic
advantage
 Unstructured, semi-structured and machine data
 Evolving schemas, just-in-time schemas
 Links, images, genomes, geo-positions, log data
…

Houston: we have a problem
 Traditional RDBMSs cannot handle
 The new data types
 Extended analytic processing
 Terabytes/hour loading with immediate query
access
 We want to use SQL and SQL-like languages,
but we don’t want the RDBMS storage
constraints…
 The disruptive solution: Hadoop

The Data Warehouse Stack in
Hadoop
 Hadoop is an open source distributed storage and
processing framework
 To understand how data warehousing is different in
Hadoop, start with this powerful architecture difference:

Hadoop for Exploratory DW/BI
• Query engines can access HDFS files before ETL
• BI tools are the ultimate glue integrating EDW
HDFS Files:
Sources: Trans-
actions
Free
text
Images
Machines/
Sensors
Links/
Networks
Metadata (system table): HCatalo
g
Query Engines:
BI
Tools:
Tableau
Industry standad HW;
Fault tolerant; Replicated;
Write once(!); Agnostic
content; Scalable to
“infinity”
Others…
Bus
Obj
Cognos QlikVie
w
Others…
All clients can use this to read
files
These are query
engines, not
databases!
Purpose built for
EXTREME I/O
speeds;
Use ETL tool or
Sqoop
EDW
Overflow
Hive
SQL
Impala
SQL

Data Load to Query in One
Step
 Copy into HDFS with ETL tool, Sqoop, or
Flume
into standard HDFS files (write once)
registering metadata with HCatalog
 Declare query schema in Hive or Impala
(no data copying or reloading)
 Immediately launch familiar SQL queries:
“Exploratory BI”

Typical Large Hadoop Cluster
 100 nodes (5 racks)
 Each node
 Dual hex core CPU running at 3 GHz
 64-378 GB of RAM
 24-36 TB disk storage (6-10 TB effective storage
with default redundancy of 3X)
 Overall cluster (!)
 6.4-37.8 TB of RAM (wow, think about this…)
 Up to a PB of effective storage
 Approximate fully loaded cost per TB: $1000 +/-

Committing to High
Performance HDFS files with
Embedded Schemas
10
HDFS Raw Files:
Sources: Trans-
actions
Free
text
Images
Machines/
Sensors
Links/
Networks
Metadata (system table): HCatalo
g
Query Engines: Hive
SQL
Impala
SQL
BI
Tools:
Tableau
Commodity HW;
Fault tolerant; Replicated;
Append Only(!); Agnostic
content; Scalable to
“infinity”
Bus
Obj
Cognos QlikVie
w
Others…
All clients can use this to read
files
Parquet Columnar
FILES:
Read optimized schema
defined
column store
Purpose built for
EXTREME I/O
speeds;
Use ETL tool or
Sqoop
EDW
Overflow
Others…
These are query
engines, not
databases!

High Performance Data
Warehouse Thread in Hadoop
 Copy data from raw HDFS file into
Parquet columnar file
 Parquet is not a database: it’s a file accessible to
multiple query and analysis apps
 Parquet data can be updated and the schema modified
 Query Parquet data with Hive or Impala
 At least 10x performance gain over simple raw file
 Hive launches MapReduce jobs: relation scan
 Ideal for ETL and transfer to conventional EDW
 Impala launches in-memory individual queries
 Ideal for interactive query in Hadoop destination DW
 Impala at least 10x additional performance gain over Hive

Use Hadoop as Platform for
Direct Analysis or ETL to
Text/Number DB
 Huge array of special analysis apps for
 Unstructured text
 Hyper structured text/numbers (machine data)
 Positional data from GPS
 Images
 Audio, video
 Consume results with increasing SQL support
from these individual apps
 Or, write text/number data into Hadoop
from unstructured source or external EDW
relational DBMS

The Larger Picture: Why Use
Hadoop as Part of Your EDW?
 Strategic:
 Open floodgates to new kinds of data
 New kinds of analysis impossible in RDBMS
 “Schema on read” for exploratory BI
 Attack same data from multiple perspectives
 Choose SQL and non-SQL approaches at query time
 Keep hyper granular data in “active archive” forever
 No compromise data analysis
 Compliance
 Simultaneous incompatible analysis modes on same data files
 Enterprise data hub: one location for all data resources
 Tactical:
 Dramatically lowered operational costs
 Linear scaling across response time, concurrency, and
data size well beyond petabytes
 Highly reliable write-once, redundantly stored data
 Meet ETL SLAs

It’s Not That Difficult
 Important existing tools already work in Hadoop
 ETL tool suites: familiar data flows and user interfaces
 BI query tools: identical user interfaces, integration
 Standard job schedulers, sort packages (e.g.
SyncSort)
 Skills you need anyway:
 Java, Python or Ruby, C, SQL, Sqoop data transfer
 Linux admin
 but, MapReduce programming no longer needed
 Investigate and add incrementally:
 Analytic tools: MADLib extensions to RDBMS, SAS, R
 Specialty data tools
 E.g., Splunk (machine data)

Integration is Crucial
 Integration is MORE than bringing separate data
sources onto a common platform.
 Suppose you have two customer facing data
sources in your DW producing the following
results.
Is this integration?

Doing Integration the Right Way
 Teaspoon sip of EDW 101 for Hadoop
Professionals!
 Build a conformed dimension library
 Plan to download dimensions from EDW
 Attach conformed dimensions to every possible
source
 Join dimensions at query time to fact tables
in SQL-capable files
 Embed dimension content as columns in NoSQL
structures, and also HBase.

Integrating Big Data
 Remember: Data warehouse integration is drilling
across:
 Establish conformed attributes
(e.g., Customer Category) in each database
 Fetch separate answer sets from different
platforms grouped on the same conformed
attributes
 Sort-merge the answer sets at the BI layer

Out of the Box Possibility:
Billions of Rows, Millions of
Columns
 Tough problem for all current relational platforms:
huge Name-Value data sources (e.g. customer
observations)
 Think about Hbase (!)
 Intended for “impossibly wide schemas”
 Fully general binary data content
 Fire hose SCD1 and SCD2 updates of individual
records
 Continuously growing row and columns
 Only simple SQL direct access possible now: no
joins…


Summing Up:
The Data Warehouse
Renaissance
 Hadoop DW becomes equal partner with Enterprise
DW
 Hadoop will be the strategic environment of choice
for new data types and new analysis modes
 Hadoop:
 Extreme data type diversity
 Huge library of specialty analysis tools with SQL
extensions
 Starting point for exploratory BI and ETL-to-EDW
processing
 Destination point for serious BI
 Permanent active archive of hyper granular data
 BI tools implement Hadoop-to-EDW integration


The Kimball Group Resource
 www.kimballgroup.com
 Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd
Ed.
 In depth data warehouse classes
taught by primary authors
 Dimensional modeling (Ralph/Margy)
 ETL architecture (Ralph/Bob)
 Dimensional design reviews and consulting
by Kimball Group principals
 White Papers
on Integration, Data Quality, and Big Data
Analytics

“A data warehouse DBMS is
now expected to coordinate
data virtualization strategies,
and distributed file and/or
processing approaches, to
address changes in data
management and access
requirements.”
– 2014 Gartner Magic Quadrant
for Data Warehouse DBMS
Data Warehousing,
Meet Hadoop

What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science” expertise
• Missing enterprise-grade features
• Complexity of DIY open source

From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
FILESYSTEM
MAPREDUCE
HDFS
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖

BATCH
PROCESSING
SYSTEM
MANAGEMENT
FILESYSTEM
MAPREDUCE
HDFS
CLOUDERAMANAGER
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
✔✔

Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
✔✔
✔✔
BATCH
PROCESSING
ANALYTIC
SQL
SEARCH
ENGINE
MACHINE
LEARNING
STREAM
PROCESSING
3RD
PARTY
APPS
WORKLOAD MANAGEMENT
SYSTEM
MANAGEMENT
FILESYSTEM ONLINE NOSQL
MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING
YARN
HDFS HBASE
CLOUDERAMANAGER

BATCH
PROCESSING
ANALYTIC
SQL
SEARCH
ENGINE
MACHINE
LEARNING
STREAM
PROCESSING
3RD
PARTY
APPS
WORKLOAD MANAGEMENT
DATA
MANAGEMENT
SYSTEM
MANAGEMENT
FILESYSTEM ONLINE NOSQL
MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING
YARN
HDFS HBASE
CLOUDERANAVIGATORCLOUDERAMANAGER
SENTRY
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
✔✔
✔✔
✔✔

Partners
Proactive &
Predictive Support
Professional
Services
Training
Cloudera: Your Trusted Advisor for Big Data
28
Advance from Strategy to ROI with Best Practices and Peak Performance

BusinessBusinessITIT
Disrupt the Industry, Not Your Business
Data
Science
Agile
Exploration
ETL
Acceleration
Cheap
Storage
EDW
Optimization
Customer
360
Your Journey to Gaining Value from All Your Data
Operational EfficiencyOperational Efficiency
(Faster, Bigger, Cheaper)(Faster, Bigger, Cheaper)
Transformative ApplicationsTransformative Applications
(New Business Value)(New Business Value)

Thank you for attending!
• Submit questions in the Q&A
panel
• For a comprehensive set of
Data Warehouse resources -
books, in depth classes,
overall design consulting
http://www.kimballgroup.com
• Follow:
• @cloudera
• @mattbrandwein
Register now for our next Webinar
with Dr. Ralph Kimball:
Best Practices for the Hadoop Data
Warehouse
EDW 101 for Hadoop Professionals
Online Webinar | May 29, 2014
10AM PT / 1PM ET
http://tinyurl.com/kimballwebinar

Building a Hadoop Data Warehouse: Hadoop 101 for Enterprise Data Warehouse Professionals

Recommended

Recommended

More Related Content

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Building a Hadoop Data Warehouse: Hadoop 101 for Enterprise Data Warehouse Professionals