EDW and Hadoop

•Download as ODP, PDF•

0 likes•130 views

Tapio Vaattanen

Enterprise Data Warehouse and Hadoop

Technology

Tapio Vaattanen <vaattanen@gmail.com>
EDW and Hadoop
"It's coexistence or no existence." - Bertrand Russell

2
Enterprise Data Warehouse (EDW)
• Used for reporting and data analysis.
• Data warehouse appliances has become EDW trend.
• Before EDW can be utilized, data must be loaded to tables using ETL
• Data is accessed by applications using SQL.
High level architecure of conventional RDBMS
System tables
metadata / statistics
Database tables
storage / tablespaces
SQL query engine
optimizer / query planD
R
I
V
E
R
Application:
- BusinessObjects
- Tableau
- Cognos
- Other
Exadata
Greenplum
Netezza
Redshift
Teradata
DB2

3
Enterprise Data Warehouse (EDW)
• Used for reporting and data analysis.
• Data warehouse appliances has become EDW trend.
• Before EDW can be utilized, data must be loaded to tables using ETL
• Data is accessed by applications using SQL.
High level architecure of conventional RDBMS
System tables
metadata / statistics
Database tables
storage / tablespaces
SQL query engine
optimizer / query plan
METADATA
STORAGE
QUERY
L
A
Y
E
R

4
What are EDW benefits?
• Long history guarantees there is no need
to re-invent the wheel.
• There is a lot of knowledgeable resources
available.
• SQL is standard, so migrating from one platform to another is possible,
although requires some amount of resources.
• With highly tuned database and structured data, you can get results
extremely fast.
• There basically endless amount of tools available for various scenarios.

5
What are EDW constraints?
• ETL is expensive.
• Limited to predefined data types.
• Vendor lock in.
• SQL: if all you have is hammer, all you see
is nails.
• Cost efficiency: ~$10,000/TB.
• Scalability: linear vs. non-linear
• Capacity: TB vs. PB

6
Hadoop
• Distributed open source framework for storage and processing.
• Hadoop core consist of storage part (HDFS), cluster resource management part
(YARN) and MapReduce computing framework.
• YARN provides resource management not only for MapReduce, but for various other
computing frameworks including Spark, Impala, SOLR among others.
• Applications connects to these higher level computing frameworks.
YARN
HDFS
MapRed ImpalaSpark SOLR
High level architecure of Hadoop

7
What are Hadoop benefits?
• Hadoop provides HA and linear scalability by default.
• There necessarily no need for ETL:
• You can copy the data to HDFS and start immediately
analyzing, querying and processing it.
• Storage capacity: PB vs. TB.
• Cost efficiency: ~$1,000/TB

8
What are Hadoop benefits?
• On Hadoop query, metadata and storage layers are separate:
• Supports SQL and non-SQL query/processing engines.
• Catalog can have different descriptions for same data files.
• Users are able to access same data files with different query engines.
• Users are not limited to SQL.
RDBMS LAYER Hadoop
System tables
metadata / statistics
Database tables
storage / tablespaces
SQL query engine
optimizer / query plan
METADATA
STORAGE
QUERY Hive Impala Spark SOLR
HDFS
HCatalogMetastore

9
Coexistence: Hadoop as part of EDW
●
Offload part of ETL workloads to Hadoop.
●
Use Hadoop as low cost storage: Active Archive
●
Utilize existing BI and ETL tools with Hadoop.
ETL
EDW
SDO
YARN
HDFS
MapRedImpalaSparkSOLR
ScoopFlume
BI
Kafka

10
Q&A
●
References:
●
Hadoop and the Data Warehouse: Hadoop 101 for EDW
Professionals:
http://www.cloudera.com/content/dam/www/marketing/resources/webinars/building-a-hadoop-data-warehouse-video.png.l
●
Using Hcatalog: https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat
●
Configuring the Hive Metastore:
http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-1/CDH4-Installation-Guide/cdh4ig_topic_18_4.html

What's hot

Hive and querying dataKarthigaGunasekaran1

Hire Hadoop DeveloperGeeks Per Hour

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit

Apache hive1sheetal sharma

Hadoop - A big data initiativeMansi Mehra

From Raw Data to Analytics with No ETLCloudera, Inc.

HiveBala Krishna

Apache HiveAmit Khandelwal

Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon

ODI11g, Hadoop and "Big Data" SourcesMark Rittman

Big data overviewbeCloudReady

Intro to bigdata on gcp (1)SahilRaina21

Apache hiveVaibhav Kadu

Azure data factoryBizTalk360

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn

Online Analytical Processingnayakslideshare

An Introduction of Apache HadoopKMS Technology

What's hot (20)

Hive and querying data

Hire Hadoop Developer

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million

Apache hive1

Hadoop - A big data initiative

From Raw Data to Analytics with No ETL

Hive

Apache Hive

Design of Hadoop Distributed File System

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

ODI11g, Hadoop and "Big Data" Sources

Big data overview

Intro to bigdata on gcp (1)

Apache hive

Azure data factory

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...

Online Analytical Processing

An Introduction of Apache Hadoop

Similar to EDW and Hadoop

Introduction to HadoopDr. C.V. Suresh Babu

Big data and hadoopPrashanth Yennampelli

SQL on HadoopBigdatapump

Big data Hadoop Ayyappan Paramesh

Apache Tajo - An open source big data warehousehadoopsphere

How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences

HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptManiMaran230751

Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime

Big data architectures and the data lakeJames Serra

Data engineeringParimala Killada

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana

Big Data and Cloud ComputingFarzad Nozarian

Backup and Disaster Recovery in Hadooplarsgeorge

Talend for big_data_intorductionLakshman Dhullipalla

Meetup Oracle Database BCN: 2.1 Data Management Trendsavanttic Consultoría Tecnológica

Designing modern dw and data lakepunedevscom

Hadoop & Data Warehouse Mohit Srivastava

Session 01 - Into to HadoopAnandMHadoop

Unit IV.pdfKennyPratheepKumar

Similar to EDW and Hadoop (20)

Introduction to Hadoop

Big data and hadoop

SQL on Hadoop

Big data Hadoop

Apache Tajo - An open source big data warehouse

How to use Big Data and Data Lake concept in business using Hadoop and Spark...

HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt

Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Big data architectures and the data lake

Data engineering

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

M. Florence Dayana - Hadoop Foundation for Analytics.pptx

Big Data and Cloud Computing

Backup and Disaster Recovery in Hadoop

Talend for big_data_intorduction

Meetup Oracle Database BCN: 2.1 Data Management Trends

Designing modern dw and data lake

Hadoop & Data Warehouse

Session 01 - Into to Hadoop

Unit IV.pdf

Recently uploaded

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Key Features Of Token Development (1).pptxLBM Solutions

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime

CloudStudio User manual (basic edition):comworks

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Recently uploaded (20)

Understanding the Laravel MVC Architecture

Next-generation AAM aircraft unveiled by Supernal, S-A2

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Key Features Of Token Development (1).pptx

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

SQL Database Design For Developers at php[tek] 2024

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Breaking the Kubernetes Kill Chain: Host Path Mount

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

GenCyber Cyber Security Day Presentation

Maximizing Board Effectiveness 2024 Webinar.pptx

Pigging Solutions Piggable Sweeping Elbows

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget

CloudStudio User manual (basic edition):

Pigging Solutions in Pet Food Manufacturing

EDW and Hadoop

1. Tapio Vaattanen <vaattanen@gmail.com> EDW and Hadoop "It's coexistence or no existence." - Bertrand Russell

2. 2 Enterprise Data Warehouse (EDW) • Used for reporting and data analysis. • Data warehouse appliances has become EDW trend. • Before EDW can be utilized, data must be loaded to tables using ETL • Data is accessed by applications using SQL. High level architecure of conventional RDBMS System tables metadata / statistics Database tables storage / tablespaces SQL query engine optimizer / query planD R I V E R Application: - BusinessObjects - Tableau - Cognos - Other Exadata Greenplum Netezza Redshift Teradata DB2

3. 3 Enterprise Data Warehouse (EDW) • Used for reporting and data analysis. • Data warehouse appliances has become EDW trend. • Before EDW can be utilized, data must be loaded to tables using ETL • Data is accessed by applications using SQL. High level architecure of conventional RDBMS System tables metadata / statistics Database tables storage / tablespaces SQL query engine optimizer / query plan METADATA STORAGE QUERY L A Y E R

4. 4 What are EDW benefits? • Long history guarantees there is no need to re-invent the wheel. • There is a lot of knowledgeable resources available. • SQL is standard, so migrating from one platform to another is possible, although requires some amount of resources. • With highly tuned database and structured data, you can get results extremely fast. • There basically endless amount of tools available for various scenarios.

5. 5 What are EDW constraints? • ETL is expensive. • Limited to predefined data types. • Vendor lock in. • SQL: if all you have is hammer, all you see is nails. • Cost efficiency: ~$10,000/TB. • Scalability: linear vs. non-linear • Capacity: TB vs. PB

6. 6 Hadoop • Distributed open source framework for storage and processing. • Hadoop core consist of storage part (HDFS), cluster resource management part (YARN) and MapReduce computing framework. • YARN provides resource management not only for MapReduce, but for various other computing frameworks including Spark, Impala, SOLR among others. • Applications connects to these higher level computing frameworks. YARN HDFS MapRed ImpalaSpark SOLR High level architecure of Hadoop

7. 7 What are Hadoop benefits? • Hadoop provides HA and linear scalability by default. • There necessarily no need for ETL: • You can copy the data to HDFS and start immediately analyzing, querying and processing it. • Storage capacity: PB vs. TB. • Cost efficiency: ~$1,000/TB

8. 8 What are Hadoop benefits? • On Hadoop query, metadata and storage layers are separate: • Supports SQL and non-SQL query/processing engines. • Catalog can have different descriptions for same data files. • Users are able to access same data files with different query engines. • Users are not limited to SQL. RDBMS LAYER Hadoop System tables metadata / statistics Database tables storage / tablespaces SQL query engine optimizer / query plan METADATA STORAGE QUERY Hive Impala Spark SOLR HDFS HCatalogMetastore

9. 9 Coexistence: Hadoop as part of EDW ● Offload part of ETL workloads to Hadoop. ● Use Hadoop as low cost storage: Active Archive ● Utilize existing BI and ETL tools with Hadoop. ETL EDW SDO YARN HDFS MapRedImpalaSparkSOLR ScoopFlume BI Kafka

10. 10 Q&A ● References: ● Hadoop and the Data Warehouse: Hadoop 101 for EDW Professionals: http://www.cloudera.com/content/dam/www/marketing/resources/webinars/building-a-hadoop-data-warehouse-video.png.l ● Using Hcatalog: https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat ● Configuring the Hive Metastore: http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-1/CDH4-Installation-Guide/cdh4ig_topic_18_4.html

Editor's Notes

EDW is used for reporting and anlysis The data is collected from various sources: Operational Data Store (ODS) External sources. Applications connect to EDW through database driver. Query engine consist of optimizer and mechanism to query the actual tables. The effectiveness of query engine depends on how accurate the statistics are. The underlying storage layer determines how fast the I/O is when running the queries. On DW appliances user must not sepnd that much time on figuring out how to crate the physical layout of the storage.
When we take lcoser look to different EDW/RDBMS layers, we note that there is basically three layers: QUERY METADATA STORAGE On traditional EDW and RDBMS these are glued together, and you can not replace any of them. This is basically how all the RDBMS work: Exadata, Teradata, MS SQL, Netezza, DB2...
RDBMS came late 70’s There is great heritage related with relational model and RDBMS. Computer Science studies start very early on going through relational database model and typical RDBMS model. It is relatively easy to find people with decent SQL and RDBMS skills. Different software vendors offer various tools for various problems. Free and proprietary.
New York Stock Exhange generates 4-5TB data everyday. Typical appliance can load optimally 5TB per data per hour highly trasformed data We would need to also perform maintenance tasks: generate stats, reorg maintaining the ideces so on. Quite fast on certain situation, processing 24h data might take 24 hours, which just doesn’t work anymore. RDBMS model is basically a black box. You are limited to SQL and you can not change the SQL engine to anyother SQL engine or non-SQL engine. 1TB costs ~ $10,000 Maximum capacity is typically on Tbs and maximally 1-2PB.
Todays requirements: NYSE generates 5TB data per day Facebook has more 240 billion photos growing at 7PB per month. Ancestory.com stores 10PB data The internet archive stores 18.5PB Size of digital universe is 4.4 zettabytes 2013 and it is estimated to grow to 44 zettabytes by 2020. Zettabyte is 10^^22 Other examples of growing mountain of data: Machine logs RFID readers GPS traces Retail transactions “More data usually beats better algorhitms”
Hadoop works well on unstructured data, because because it is designed to interpret data at processing time (schema on read). This provides felxibility and avoids need for ETL/ELT. No need for data normaliztion for removing redundancy. Web server log is a good example of non normalized data. MapReduce and other processing models scales linerly with the size of the data. Data is partitioned and functional primitives can work in parallel on separate partitions. If you double the size of the data, it takes twice to process. If you double the size of the cluster, when size of the data doubles, processing takes same time as before.
With Hadoop the QUERY layer is interchangeable. For SQL like queries you can change from Hive to Impala and vice versa. You can also use non-SQL processing frameworks. The METADATA layer where catalog is placed can have multiple access points to data files. Two users can run different queries against same data files and utlise different schema. Different BI-tools which we are already familiar from EDW world can utlise the SQL engines on Hadoop. We still have more tools than just SQL, so we have more than just hammer.
In real life many times small fraction of workloads might consume majority of computing resources. I.e. ETL workloads takes 60%-70% of computing resources. You can probably fund the Hadoop project with savings from offloading part of ETL workloads to Hadoop. EDW can then be freed to run the core workloads which was the reason to purchase the EDW in the first place. No need to expand EDW budget. Utlize Hadoop as low cost storage and use it as Active Archive. During the project you get resources trained and Hadoop cluster established.

EDW and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to EDW and Hadoop

Similar to EDW and Hadoop (20)

Recently uploaded

Recently uploaded (20)

EDW and Hadoop

Editor's Notes