Dr. Ralph Kimball will describe how Apache Hadoop complements and integrates effectively with the existing enterprise data warehouse. The Hadoop environment's revolutionary architectural advantages open the door to more data and more kinds of data than are possible to analyze with conventional RDBMSs, and additionally offer a whole series of new forms of integrated analysis.
Dr. Kimball will explain how Hadoop can be both:
- A destination data warehouse, and also
- An efficient staging and ETL source for an existing data warehouse
You will also learn how enterprise conformed dimensions can be used as the basis for integrating Hadoop and conventional data warehouses.
On-demand recording: http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/building-a-hadoop-data-warehouse-video.html
2. The Data Warehouse Mission
Identify all possible enterprise data assets
Select those assets that have actionable
content and can be accessed
Bring the data assets into a logically
centralized “enterprise data warehouse”
Expose those data assets most effectively for
decision making
3. Enormous RDBMS Legacy
Legacy RDBMSs have been spectacularly
successful, and we will continue to use them.
Too successful… If all you have is a hammer,
everything looks like a nail.
RDBMS dilemma: a new ocean of new data
types that are being monetized for strategic
advantage
Unstructured, semi-structured and machine data
Evolving schemas, just-in-time schemas
Links, images, genomes, geo-positions, log data
…
4. Houston: we have a problem
Traditional RDBMSs cannot handle
The new data types
Extended analytic processing
Terabytes/hour loading with immediate query
access
We want to use SQL and SQL-like languages,
but we don’t want the RDBMS storage
constraints…
The disruptive solution: Hadoop
5. The Data Warehouse Stack in
Hadoop
Hadoop is an open source distributed storage and
processing framework
To understand how data warehousing is different in
Hadoop, start with this powerful architecture difference:
6. The Data Warehouse Stack in
Hadoop
Hadoop is an open source distributed storage and
processing framework
To understand how data warehousing is different in
Hadoop, start with this powerful architecture difference:
7. Hadoop for Exploratory DW/BI
• Query engines can access HDFS files before ETL
• BI tools are the ultimate glue integrating EDW
HDFS Files:
Sources: Trans-
actions
Free
text
Images
Machines/
Sensors
Links/
Networks
Metadata (system table): HCatalo
g
Query Engines:
BI
Tools:
Tableau
Industry standad HW;
Fault tolerant; Replicated;
Write once(!); Agnostic
content; Scalable to
“infinity”
Others…
Bus
Obj
Cognos QlikVie
w
Others…
All clients can use this to read
files
These are query
engines, not
databases!
Purpose built for
EXTREME I/O
speeds;
Use ETL tool or
Sqoop
EDW
Overflow
Hive
SQL
Impala
SQL
8. Data Load to Query in One
Step
Copy into HDFS with ETL tool, Sqoop, or
Flume
into standard HDFS files (write once)
registering metadata with HCatalog
Declare query schema in Hive or Impala
(no data copying or reloading)
Immediately launch familiar SQL queries:
“Exploratory BI”
9. Typical Large Hadoop Cluster
100 nodes (5 racks)
Each node
Dual hex core CPU running at 3 GHz
64-378 GB of RAM
24-36 TB disk storage (6-10 TB effective storage
with default redundancy of 3X)
Overall cluster (!)
6.4-37.8 TB of RAM (wow, think about this…)
Up to a PB of effective storage
Approximate fully loaded cost per TB: $1000 +/-
10. Committing to High
Performance HDFS files with
Embedded Schemas
10
HDFS Raw Files:
Sources: Trans-
actions
Free
text
Images
Machines/
Sensors
Links/
Networks
Metadata (system table): HCatalo
g
Query Engines: Hive
SQL
Impala
SQL
BI
Tools:
Tableau
Commodity HW;
Fault tolerant; Replicated;
Append Only(!); Agnostic
content; Scalable to
“infinity”
Bus
Obj
Cognos QlikVie
w
Others…
All clients can use this to read
files
Parquet Columnar
FILES:
Read optimized schema
defined
column store
Purpose built for
EXTREME I/O
speeds;
Use ETL tool or
Sqoop
EDW
Overflow
Others…
These are query
engines, not
databases!
11. High Performance Data
Warehouse Thread in Hadoop
Copy data from raw HDFS file into
Parquet columnar file
Parquet is not a database: it’s a file accessible to
multiple query and analysis apps
Parquet data can be updated and the schema modified
Query Parquet data with Hive or Impala
At least 10x performance gain over simple raw file
Hive launches MapReduce jobs: relation scan
Ideal for ETL and transfer to conventional EDW
Impala launches in-memory individual queries
Ideal for interactive query in Hadoop destination DW
Impala at least 10x additional performance gain over Hive
12. Use Hadoop as Platform for
Direct Analysis or ETL to
Text/Number DB
Huge array of special analysis apps for
Unstructured text
Hyper structured text/numbers (machine data)
Positional data from GPS
Images
Audio, video
Consume results with increasing SQL support
from these individual apps
Or, write text/number data into Hadoop
from unstructured source or external EDW
relational DBMS
13. The Larger Picture: Why Use
Hadoop as Part of Your EDW?
Strategic:
Open floodgates to new kinds of data
New kinds of analysis impossible in RDBMS
“Schema on read” for exploratory BI
Attack same data from multiple perspectives
Choose SQL and non-SQL approaches at query time
Keep hyper granular data in “active archive” forever
No compromise data analysis
Compliance
Simultaneous incompatible analysis modes on same data files
Enterprise data hub: one location for all data resources
Tactical:
Dramatically lowered operational costs
Linear scaling across response time, concurrency, and
data size well beyond petabytes
Highly reliable write-once, redundantly stored data
Meet ETL SLAs
14. It’s Not That Difficult
Important existing tools already work in Hadoop
ETL tool suites: familiar data flows and user interfaces
BI query tools: identical user interfaces, integration
Standard job schedulers, sort packages (e.g.
SyncSort)
Skills you need anyway:
Java, Python or Ruby, C, SQL, Sqoop data transfer
Linux admin
but, MapReduce programming no longer needed
Investigate and add incrementally:
Analytic tools: MADLib extensions to RDBMS, SAS, R
Specialty data tools
E.g., Splunk (machine data)
15. Integration is Crucial
Integration is MORE than bringing separate data
sources onto a common platform.
Suppose you have two customer facing data
sources in your DW producing the following
results.
Is this integration?
16. Doing Integration the Right Way
Teaspoon sip of EDW 101 for Hadoop
Professionals!
Build a conformed dimension library
Plan to download dimensions from EDW
Attach conformed dimensions to every possible
source
Join dimensions at query time to fact tables
in SQL-capable files
Embed dimension content as columns in NoSQL
structures, and also HBase.
17. Integrating Big Data
Remember: Data warehouse integration is drilling
across:
Establish conformed attributes
(e.g., Customer Category) in each database
Fetch separate answer sets from different
platforms grouped on the same conformed
attributes
Sort-merge the answer sets at the BI layer
18. Out of the Box Possibility:
Billions of Rows, Millions of
Columns
Tough problem for all current relational platforms:
huge Name-Value data sources (e.g. customer
observations)
Think about Hbase (!)
Intended for “impossibly wide schemas”
Fully general binary data content
Fire hose SCD1 and SCD2 updates of individual
records
Continuously growing row and columns
Only simple SQL direct access possible now: no
joins…
19. Summing Up:
The Data Warehouse
Renaissance
Hadoop DW becomes equal partner with Enterprise
DW
Hadoop will be the strategic environment of choice
for new data types and new analysis modes
Hadoop:
Extreme data type diversity
Huge library of specialty analysis tools with SQL
extensions
Starting point for exploratory BI and ETL-to-EDW
processing
Destination point for serious BI
Permanent active archive of hyper granular data
BI tools implement Hadoop-to-EDW integration
20. The Kimball Group Resource
www.kimballgroup.com
Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd
Ed.
In depth data warehouse classes
taught by primary authors
Dimensional modeling (Ralph/Margy)
ETL architecture (Ralph/Bob)
Dimensional design reviews and consulting
by Kimball Group principals
White Papers
on Integration, Data Quality, and Big Data
Analytics
21. “A data warehouse DBMS is
now expected to coordinate
data virtualization strategies,
and distributed file and/or
processing approaches, to
address changes in data
management and access
requirements.”
– 2014 Gartner Magic Quadrant
for Data Warehouse DBMS
Data Warehousing,
Meet Hadoop
22. What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science” expertise
• Missing enterprise-grade features
• Complexity of DIY open source
23. What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science” expertise
• Missing enterprise-grade features
• Complexity of DIY open source
24. From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
FILESYSTEM
MAPREDUCE
HDFS
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
25. From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
SYSTEM
MANAGEMENT
FILESYSTEM
MAPREDUCE
HDFS
CLOUDERAMANAGER
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
✔✔
26. From Apache Hadoop to an enterprise data hub
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
✔✔
✔✔
BATCH
PROCESSING
ANALYTIC
SQL
SEARCH
ENGINE
MACHINE
LEARNING
STREAM
PROCESSING
3RD
PARTY
APPS
WORKLOAD MANAGEMENT
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
SYSTEM
MANAGEMENT
FILESYSTEM ONLINE NOSQL
MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING
YARN
HDFS HBASE
CLOUDERAMANAGER
27. From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
ANALYTIC
SQL
SEARCH
ENGINE
MACHINE
LEARNING
STREAM
PROCESSING
3RD
PARTY
APPS
WORKLOAD MANAGEMENT
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
DATA
MANAGEMENT
SYSTEM
MANAGEMENT
FILESYSTEM ONLINE NOSQL
MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING
YARN
HDFS HBASE
CLOUDERANAVIGATORCLOUDERAMANAGER
SENTRY
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
✔✔
✔✔
✔✔
29. What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science” expertise
• Missing enterprise-grade features
• Complexity of DIY open source
30. BusinessBusinessITIT
Disrupt the Industry, Not Your Business
Data
Science
Agile
Exploration
ETL
Acceleration
Cheap
Storage
EDW
Optimization
Customer
360
Your Journey to Gaining Value from All Your Data
Operational EfficiencyOperational Efficiency
(Faster, Bigger, Cheaper)(Faster, Bigger, Cheaper)
Transformative ApplicationsTransformative Applications
(New Business Value)(New Business Value)
31. Thank you for attending!
• Submit questions in the Q&A
panel
• For a comprehensive set of
Data Warehouse resources -
books, in depth classes,
overall design consulting
http://www.kimballgroup.com
• Follow:
• @cloudera
• @mattbrandwein
Register now for our next Webinar
with Dr. Ralph Kimball:
Best Practices for the Hadoop Data
Warehouse
EDW 101 for Hadoop Professionals
Online Webinar | May 29, 2014
10AM PT / 1PM ET
http://tinyurl.com/kimballwebinar