SlideShare a Scribd company logo
1 of 10
Tapio Vaattanen <vaattanen@gmail.com>
EDW and Hadoop
"It's coexistence or no existence." - Bertrand Russell
2
Enterprise Data Warehouse (EDW)
• Used for reporting and data analysis.
• Data warehouse appliances has become EDW trend.
• Before EDW can be utilized, data must be loaded to tables using ETL
• Data is accessed by applications using SQL.
High level architecure of conventional RDBMS
System tables
metadata / statistics
Database tables
storage / tablespaces
SQL query engine
optimizer / query planD
R
I
V
E
R
Application:
- BusinessObjects
- Tableau
- Cognos
- Other
Exadata
Greenplum
Netezza
Redshift
Teradata
DB2
3
Enterprise Data Warehouse (EDW)
• Used for reporting and data analysis.
• Data warehouse appliances has become EDW trend.
• Before EDW can be utilized, data must be loaded to tables using ETL
• Data is accessed by applications using SQL.
High level architecure of conventional RDBMS
System tables
metadata / statistics
Database tables
storage / tablespaces
SQL query engine
optimizer / query plan
METADATA
STORAGE
QUERY
L
A
Y
E
R
4
What are EDW benefits?
• Long history guarantees there is no need
to re-invent the wheel.
• There is a lot of knowledgeable resources
available.
• SQL is standard, so migrating from one platform to another is possible,
although requires some amount of resources.
• With highly tuned database and structured data, you can get results
extremely fast.
• There basically endless amount of tools available for various scenarios.
5
What are EDW constraints?
• ETL is expensive.
• Limited to predefined data types.
• Vendor lock in.
• SQL: if all you have is hammer, all you see
is nails.
• Cost efficiency: ~$10,000/TB.
• Scalability: linear vs. non-linear
• Capacity: TB vs. PB
6
Hadoop
• Distributed open source framework for storage and processing.
• Hadoop core consist of storage part (HDFS), cluster resource management part
(YARN) and MapReduce computing framework.
• YARN provides resource management not only for MapReduce, but for various other
computing frameworks including Spark, Impala, SOLR among others.
• Applications connects to these higher level computing frameworks.
YARN
HDFS
MapRed ImpalaSpark SOLR
High level architecure of Hadoop
7
What are Hadoop benefits?
• Hadoop provides HA and linear scalability by default.
• There necessarily no need for ETL:
• You can copy the data to HDFS and start immediately
analyzing, querying and processing it.
• Storage capacity: PB vs. TB.
• Cost efficiency: ~$1,000/TB
8
What are Hadoop benefits?
• On Hadoop query, metadata and storage layers are separate:
• Supports SQL and non-SQL query/processing engines.
• Catalog can have different descriptions for same data files.
• Users are able to access same data files with different query engines.
• Users are not limited to SQL.
RDBMS LAYER Hadoop
System tables
metadata / statistics
Database tables
storage / tablespaces
SQL query engine
optimizer / query plan
METADATA
STORAGE
QUERY Hive Impala Spark SOLR
HDFS
HCatalogMetastore
9
Coexistence: Hadoop as part of EDW
●
Offload part of ETL workloads to Hadoop.
●
Use Hadoop as low cost storage: Active Archive
●
Utilize existing BI and ETL tools with Hadoop.
ETL
EDW
SDO
YARN
HDFS
MapRedImpalaSparkSOLR
ScoopFlume
BI
Kafka
10
Q&A
●
References:
●
Hadoop and the Data Warehouse: Hadoop 101 for EDW
Professionals:
http://www.cloudera.com/content/dam/www/marketing/resources/webinars/building-a-hadoop-data-warehouse-video.png.l
●
Using Hcatalog: https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat
●
Configuring the Hive Metastore:
http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-1/CDH4-Installation-Guide/cdh4ig_topic_18_4.html

More Related Content

What's hot

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLCloudera, Inc.
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesMark Rittman
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Azure data factory
Azure data factoryAzure data factory
Azure data factoryBizTalk360
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Online Analytical Processing
Online Analytical ProcessingOnline Analytical Processing
Online Analytical Processingnayakslideshare
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 

What's hot (20)

Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Hire Hadoop Developer
Hire Hadoop DeveloperHire Hadoop Developer
Hire Hadoop Developer
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Hive
HiveHive
Hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" Sources
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Apache hive
Apache hiveApache hive
Apache hive
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Online Analytical Processing
Online Analytical ProcessingOnline Analytical Processing
Online Analytical Processing
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 

Similar to EDW and Hadoop

Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehousehadoopsphere
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptManiMaran230751
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 

Similar to EDW and Hadoop (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Data engineering
Data engineeringData engineering
Data engineering
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Hadoop & Data Warehouse
Hadoop & Data Warehouse Hadoop & Data Warehouse
Hadoop & Data Warehouse
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

EDW and Hadoop

  • 1. Tapio Vaattanen <vaattanen@gmail.com> EDW and Hadoop "It's coexistence or no existence." - Bertrand Russell
  • 2. 2 Enterprise Data Warehouse (EDW) • Used for reporting and data analysis. • Data warehouse appliances has become EDW trend. • Before EDW can be utilized, data must be loaded to tables using ETL • Data is accessed by applications using SQL. High level architecure of conventional RDBMS System tables metadata / statistics Database tables storage / tablespaces SQL query engine optimizer / query planD R I V E R Application: - BusinessObjects - Tableau - Cognos - Other Exadata Greenplum Netezza Redshift Teradata DB2
  • 3. 3 Enterprise Data Warehouse (EDW) • Used for reporting and data analysis. • Data warehouse appliances has become EDW trend. • Before EDW can be utilized, data must be loaded to tables using ETL • Data is accessed by applications using SQL. High level architecure of conventional RDBMS System tables metadata / statistics Database tables storage / tablespaces SQL query engine optimizer / query plan METADATA STORAGE QUERY L A Y E R
  • 4. 4 What are EDW benefits? • Long history guarantees there is no need to re-invent the wheel. • There is a lot of knowledgeable resources available. • SQL is standard, so migrating from one platform to another is possible, although requires some amount of resources. • With highly tuned database and structured data, you can get results extremely fast. • There basically endless amount of tools available for various scenarios.
  • 5. 5 What are EDW constraints? • ETL is expensive. • Limited to predefined data types. • Vendor lock in. • SQL: if all you have is hammer, all you see is nails. • Cost efficiency: ~$10,000/TB. • Scalability: linear vs. non-linear • Capacity: TB vs. PB
  • 6. 6 Hadoop • Distributed open source framework for storage and processing. • Hadoop core consist of storage part (HDFS), cluster resource management part (YARN) and MapReduce computing framework. • YARN provides resource management not only for MapReduce, but for various other computing frameworks including Spark, Impala, SOLR among others. • Applications connects to these higher level computing frameworks. YARN HDFS MapRed ImpalaSpark SOLR High level architecure of Hadoop
  • 7. 7 What are Hadoop benefits? • Hadoop provides HA and linear scalability by default. • There necessarily no need for ETL: • You can copy the data to HDFS and start immediately analyzing, querying and processing it. • Storage capacity: PB vs. TB. • Cost efficiency: ~$1,000/TB
  • 8. 8 What are Hadoop benefits? • On Hadoop query, metadata and storage layers are separate: • Supports SQL and non-SQL query/processing engines. • Catalog can have different descriptions for same data files. • Users are able to access same data files with different query engines. • Users are not limited to SQL. RDBMS LAYER Hadoop System tables metadata / statistics Database tables storage / tablespaces SQL query engine optimizer / query plan METADATA STORAGE QUERY Hive Impala Spark SOLR HDFS HCatalogMetastore
  • 9. 9 Coexistence: Hadoop as part of EDW ● Offload part of ETL workloads to Hadoop. ● Use Hadoop as low cost storage: Active Archive ● Utilize existing BI and ETL tools with Hadoop. ETL EDW SDO YARN HDFS MapRedImpalaSparkSOLR ScoopFlume BI Kafka
  • 10. 10 Q&A ● References: ● Hadoop and the Data Warehouse: Hadoop 101 for EDW Professionals: http://www.cloudera.com/content/dam/www/marketing/resources/webinars/building-a-hadoop-data-warehouse-video.png.l ● Using Hcatalog: https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat ● Configuring the Hive Metastore: http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-1/CDH4-Installation-Guide/cdh4ig_topic_18_4.html

Editor's Notes

  1. EDW is used for reporting and anlysis The data is collected from various sources: Operational Data Store (ODS) External sources. Applications connect to EDW through database driver. Query engine consist of optimizer and mechanism to query the actual tables. The effectiveness of query engine depends on how accurate the statistics are. The underlying storage layer determines how fast the I/O is when running the queries. On DW appliances user must not sepnd that much time on figuring out how to crate the physical layout of the storage.
  2. When we take lcoser look to different EDW/RDBMS layers, we note that there is basically three layers: QUERY METADATA STORAGE On traditional EDW and RDBMS these are glued together, and you can not replace any of them. This is basically how all the RDBMS work: Exadata, Teradata, MS SQL, Netezza, DB2...
  3. RDBMS came late 70’s There is great heritage related with relational model and RDBMS. Computer Science studies start very early on going through relational database model and typical RDBMS model. It is relatively easy to find people with decent SQL and RDBMS skills. Different software vendors offer various tools for various problems. Free and proprietary.
  4. New York Stock Exhange generates 4-5TB data everyday. Typical appliance can load optimally 5TB per data per hour highly trasformed data We would need to also perform maintenance tasks: generate stats, reorg maintaining the ideces so on. Quite fast on certain situation, processing 24h data might take 24 hours, which just doesn’t work anymore. RDBMS model is basically a black box. You are limited to SQL and you can not change the SQL engine to anyother SQL engine or non-SQL engine. 1TB costs ~ $10,000 Maximum capacity is typically on Tbs and maximally 1-2PB.
  5. Todays requirements: NYSE generates 5TB data per day Facebook has more 240 billion photos growing at 7PB per month. Ancestory.com stores 10PB data The internet archive stores 18.5PB Size of digital universe is 4.4 zettabytes 2013 and it is estimated to grow to 44 zettabytes by 2020. Zettabyte is 10^^22 Other examples of growing mountain of data: Machine logs RFID readers GPS traces Retail transactions “More data usually beats better algorhitms”
  6. Hadoop works well on unstructured data, because because it is designed to interpret data at processing time (schema on read). This provides felxibility and avoids need for ETL/ELT. No need for data normaliztion for removing redundancy. Web server log is a good example of non normalized data. MapReduce and other processing models scales linerly with the size of the data. Data is partitioned and functional primitives can work in parallel on separate partitions. If you double the size of the data, it takes twice to process. If you double the size of the cluster, when size of the data doubles, processing takes same time as before.
  7. With Hadoop the QUERY layer is interchangeable. For SQL like queries you can change from Hive to Impala and vice versa. You can also use non-SQL processing frameworks. The METADATA layer where catalog is placed can have multiple access points to data files. Two users can run different queries against same data files and utlise different schema. Different BI-tools which we are already familiar from EDW world can utlise the SQL engines on Hadoop. We still have more tools than just SQL, so we have more than just hammer.
  8. In real life many times small fraction of workloads might consume majority of computing resources. I.e. ETL workloads takes 60%-70% of computing resources. You can probably fund the Hadoop project with savings from offloading part of ETL workloads to Hadoop. EDW can then be freed to run the core workloads which was the reason to purchase the EDW in the first place. No need to expand EDW budget. Utlize Hadoop as low cost storage and use it as Active Archive. During the project you get resources trained and Hadoop cluster established.