2. 2
Enterprise Data Warehouse (EDW)
• Used for reporting and data analysis.
• Data warehouse appliances has become EDW trend.
• Before EDW can be utilized, data must be loaded to tables using ETL
• Data is accessed by applications using SQL.
High level architecure of conventional RDBMS
System tables
metadata / statistics
Database tables
storage / tablespaces
SQL query engine
optimizer / query planD
R
I
V
E
R
Application:
- BusinessObjects
- Tableau
- Cognos
- Other
Exadata
Greenplum
Netezza
Redshift
Teradata
DB2
3. 3
Enterprise Data Warehouse (EDW)
• Used for reporting and data analysis.
• Data warehouse appliances has become EDW trend.
• Before EDW can be utilized, data must be loaded to tables using ETL
• Data is accessed by applications using SQL.
High level architecure of conventional RDBMS
System tables
metadata / statistics
Database tables
storage / tablespaces
SQL query engine
optimizer / query plan
METADATA
STORAGE
QUERY
L
A
Y
E
R
4. 4
What are EDW benefits?
• Long history guarantees there is no need
to re-invent the wheel.
• There is a lot of knowledgeable resources
available.
• SQL is standard, so migrating from one platform to another is possible,
although requires some amount of resources.
• With highly tuned database and structured data, you can get results
extremely fast.
• There basically endless amount of tools available for various scenarios.
5. 5
What are EDW constraints?
• ETL is expensive.
• Limited to predefined data types.
• Vendor lock in.
• SQL: if all you have is hammer, all you see
is nails.
• Cost efficiency: ~$10,000/TB.
• Scalability: linear vs. non-linear
• Capacity: TB vs. PB
6. 6
Hadoop
• Distributed open source framework for storage and processing.
• Hadoop core consist of storage part (HDFS), cluster resource management part
(YARN) and MapReduce computing framework.
• YARN provides resource management not only for MapReduce, but for various other
computing frameworks including Spark, Impala, SOLR among others.
• Applications connects to these higher level computing frameworks.
YARN
HDFS
MapRed ImpalaSpark SOLR
High level architecure of Hadoop
7. 7
What are Hadoop benefits?
• Hadoop provides HA and linear scalability by default.
• There necessarily no need for ETL:
• You can copy the data to HDFS and start immediately
analyzing, querying and processing it.
• Storage capacity: PB vs. TB.
• Cost efficiency: ~$1,000/TB
8. 8
What are Hadoop benefits?
• On Hadoop query, metadata and storage layers are separate:
• Supports SQL and non-SQL query/processing engines.
• Catalog can have different descriptions for same data files.
• Users are able to access same data files with different query engines.
• Users are not limited to SQL.
RDBMS LAYER Hadoop
System tables
metadata / statistics
Database tables
storage / tablespaces
SQL query engine
optimizer / query plan
METADATA
STORAGE
QUERY Hive Impala Spark SOLR
HDFS
HCatalogMetastore
9. 9
Coexistence: Hadoop as part of EDW
●
Offload part of ETL workloads to Hadoop.
●
Use Hadoop as low cost storage: Active Archive
●
Utilize existing BI and ETL tools with Hadoop.
ETL
EDW
SDO
YARN
HDFS
MapRedImpalaSparkSOLR
ScoopFlume
BI
Kafka
10. 10
Q&A
●
References:
●
Hadoop and the Data Warehouse: Hadoop 101 for EDW
Professionals:
http://www.cloudera.com/content/dam/www/marketing/resources/webinars/building-a-hadoop-data-warehouse-video.png.l
●
Using Hcatalog: https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat
●
Configuring the Hive Metastore:
http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-1/CDH4-Installation-Guide/cdh4ig_topic_18_4.html
Editor's Notes
EDW is used for reporting and anlysis
The data is collected from various sources:
Operational Data Store (ODS)
External sources.
Applications connect to EDW through database driver.
Query engine consist of optimizer and mechanism to query the actual tables.
The effectiveness of query engine depends on how accurate the statistics are.
The underlying storage layer determines how fast the I/O is when running the queries.
On DW appliances user must not sepnd that much time on figuring out how to crate the physical layout of the storage.
When we take lcoser look to different EDW/RDBMS layers, we note that there is basically three layers:
QUERY
METADATA
STORAGE
On traditional EDW and RDBMS these are glued together, and you can not replace any of them.
This is basically how all the RDBMS work: Exadata, Teradata, MS SQL, Netezza, DB2...
RDBMS came late 70’s
There is great heritage related with relational model and RDBMS.
Computer Science studies start very early on going through relational database model and typical RDBMS model.
It is relatively easy to find people with decent SQL and RDBMS skills.
Different software vendors offer various tools for various problems. Free and proprietary.
New York Stock Exhange generates 4-5TB data everyday.
Typical appliance can load optimally 5TB per data per hour highly trasformed data
We would need to also perform maintenance tasks: generate stats, reorg maintaining the ideces so on.
Quite fast on certain situation, processing 24h data might take 24 hours, which just doesn’t work anymore.
RDBMS model is basically a black box. You are limited to SQL and you can not change the SQL engine to anyother SQL engine or non-SQL engine.
1TB costs ~ $10,000
Maximum capacity is typically on Tbs and maximally 1-2PB.
Todays requirements:
NYSE generates 5TB data per day
Facebook has more 240 billion photos growing at 7PB per month.
Ancestory.com stores 10PB data
The internet archive stores 18.5PB
Size of digital universe is 4.4 zettabytes 2013 and it is estimated to grow to 44 zettabytes by 2020.
Zettabyte is 10^^22
Other examples of growing mountain of data:
Machine logs
RFID readers
GPS traces
Retail transactions
“More data usually beats better algorhitms”
Hadoop works well on unstructured data, because because it is designed to interpret data at processing time (schema on read).
This provides felxibility and avoids need for ETL/ELT.
No need for data normaliztion for removing redundancy.
Web server log is a good example of non normalized data.
MapReduce and other processing models scales linerly with the size of the data. Data is partitioned and functional primitives can work in parallel on separate partitions.
If you double the size of the data, it takes twice to process.
If you double the size of the cluster, when size of the data doubles, processing takes same time as before.
With Hadoop the QUERY layer is interchangeable.
For SQL like queries you can change from Hive to Impala and vice versa.
You can also use non-SQL processing frameworks.
The METADATA layer where catalog is placed can have multiple access points to data files.
Two users can run different queries against same data files and utlise different schema.
Different BI-tools which we are already familiar from EDW world can utlise the SQL engines on Hadoop.
We still have more tools than just SQL, so we have more than just hammer.
In real life many times small fraction of workloads might consume majority of computing resources. I.e. ETL workloads takes 60%-70% of computing resources.
You can probably fund the Hadoop project with savings from offloading part of ETL workloads to Hadoop.
EDW can then be freed to run the core workloads which was the reason to purchase the EDW in the first place. No need to expand EDW budget.
Utlize Hadoop as low cost storage and use it as Active Archive.
During the project you get resources trained and Hadoop cluster established.