Hadoop
- A Set of Technologies
Data Warehouse
- A Concept or Process
And many more..
Comparing Hadoop with Enterprise Data Warehouse ??
Vs
Any attempt to implement Hadoop technology to
replace the organizations existing data warehouse may
lead to failure..
 Hadoop set of technologies should be used to make EDW more powerful.
 A meaningful and honest assessment need to be done
 To decide where and how Hadoop can be integrated to achieve the optimized
architecture
 Finally look at few high level use cases utilizing Hadoop capabilities in DWH
Let's get into some more detail..
 Explore Data Warehouse Business Goals / Benefits
 Glimpse of Core Advantages of Hadoop
 Understand Limitations of Hadoop
Enterprise Data warehouse Business Goals / Benefits:
• Evaluate, monitor, manage and improve corporate performance.
• Customer relationship management and enhancement.
• Cleanse and improve the quality of organization's data.
• Decision support and Forecast future growth and needs
• Support, Monitor and modify a marketing campaign.
Scalable
Hadoop is highly scalable, it can
easily store and distribute very
large datasets on servers that
operate in parallel
Cost Effective
Hadoop is very cost-effective. It is
based on scale out architecture
which can affordably store big
volume of data for future use.
Data are managed through clusters based
on distributed file systems. The technique
used in mapping the data result in faster
data processing
Fast
Flexible
Failure Resistant
Hadoop enables enterprises
to access and process data in
a very easy way to generate
the values required, thereby
providing the enterprises
with the tools to get valuable
insights from various types of
data sources operating in
parallel.
One of the great advantages of Hadoop is its fault
tolerance, which is provided by replicating the data to
another node in the cluster. The data from the
replicated node can be used in the event of a failure.
Hadoop core Advantages
Hadoop Limitations
Vulnerable
Latency
Inaptness with
small data
Stability Issues
Security Concern
Hadoop is written in java which is
most used language, and been most
heavily exploited by cyber attackers
and as a result, implicated in
numerous security breaches.
Hadoop is not suited for small
data. HDFS lacks the ability to
efficiently support the random
reading of small files because of
its high capacity design.
Hadoop being an open
source platform has a
Fair possibilities of
stability issues.
HDFS is optimized to access batches of data set
quicker (high throughput), rather than
particular records in that data set (low latency)
Hadoop is missing encryption at storage and
network levels, which is a major concern.
Hadoop supports Kerberos authentication,
which is not easy to manage
Some scenarios where power of Hadoop is needed to strengthen the Data Warehouse
 Storage and Processing of semi structured and un structured data
 Reducing the cost of Data Storage in case of huge data volumes
 Increase Data retention to avoid premature data death
 Pre processing of big volume of data
CRM
ERP
Legacy
Source Systems
Third Party
External Data
Extract
Transform &
Load
Enterprise Data
Warehouse
ODS
Data Mart
Data Mart
Analytics
ETL Layer Data Repository Layer Analytics Layer
Conventional Data Warehouse Architecture
This is traditional Data Warehouse Architecture which is being used for many
organizations. There are some variance to this based on technical and organizational
needs.
Unstructured
Data Sources
Semi structured
Data Sources
Structured Data
Sources Enterprise Data
Warehouse
Advance
Analytical
Applications
Business
Intelligence
Layer
In this use case, Hadoop is being used for loading the unstructured and semi structured
data and making it available for EDW based on the organizations requirement and also
offering it for further analytical processing. The integration of new data sources into the
existing EDW will empower organizations more and deeper analytics and insights.
CRM
ERP
Legacy
Third Party
External Data
Extract
Transform &
Load
Enterprise Data
Warehouse
ODS
Data Mart
Data Mart
Analytics
Unstructured
Sources
XMLs, Doc
Files
Web Logs,
Emails
Images,
Videos
File Copy Analytic Tools
In this use case, Hadoop is being used as a main data repository and data from data
warehouse is being archived in Hadoop taking advantage of its low cost storage. Data
warehouse is being taken here as a source for Hadoop. Another point to note here is that
there is no change in existing setup of organization's EDW.
Unstructured
Sources
Structured
Sources
CRM
ERP
Legacy
XMLs, Doc Files
Web Logs, Emails
Images, Videos
Enterprise Data
Warehouse
ODS
Data Mart
Data Mart
Analytics LayerAnalytic tools
In this use case, Hadoop is shown as a layer before existing EDW. Sourcing all of the data,
Hadoop's capability of parallel processing is being utilized. It offloads majority of
transformations from EDW and feed pre processed data. EDW is used to more focus on
Aggregations and Analytical reporting.
Data Sources
XMLs, Doc Files
Web Logs, Emails
Images, Videos
CRM
ERP
Legacy
Data Lake
Extract
&
Load
Analytic Sandbox
Transformation
Enterprise Data
Warehouse
Business
Intelligence
Layer
In this scenario, Data lake is utilized and ELT over ETL is being used. A Data lake is a
storage repository that hold a vast amount of raw data in its native form and can be
transformed later as per the need. EDW is applying transformations and utilizing the data.
This kind of architecture is great for Organization's data science needs where Data
Scientists can use sandbox to apply their models on the raw data stored in Data Lake.
To Conclude..
Data Warehouse architects have more tools to play with and there is a need of detailed
analysis for the organization and business goals before choosing the right set of
technologies to build a data warehouse.
The core benefits of data warehouse are still in need and will always be. There is always
an opportunity to strengthen them by smart use of appropriate tools and technologies.
Hadoop can only fail if there is an attempt to use it just for replacement of existing data
warehouse without the proper feasibility analysis and intent to come up with optimized
architecture aligned with Organizational goals.
Hadoop & Data Warehouse

Hadoop & Data Warehouse

  • 2.
    Hadoop - A Setof Technologies Data Warehouse - A Concept or Process And many more..
  • 3.
    Comparing Hadoop withEnterprise Data Warehouse ?? Vs Any attempt to implement Hadoop technology to replace the organizations existing data warehouse may lead to failure..
  • 4.
     Hadoop setof technologies should be used to make EDW more powerful.  A meaningful and honest assessment need to be done  To decide where and how Hadoop can be integrated to achieve the optimized architecture
  • 5.
     Finally lookat few high level use cases utilizing Hadoop capabilities in DWH Let's get into some more detail..  Explore Data Warehouse Business Goals / Benefits  Glimpse of Core Advantages of Hadoop  Understand Limitations of Hadoop
  • 6.
    Enterprise Data warehouseBusiness Goals / Benefits: • Evaluate, monitor, manage and improve corporate performance. • Customer relationship management and enhancement. • Cleanse and improve the quality of organization's data. • Decision support and Forecast future growth and needs • Support, Monitor and modify a marketing campaign.
  • 7.
    Scalable Hadoop is highlyscalable, it can easily store and distribute very large datasets on servers that operate in parallel Cost Effective Hadoop is very cost-effective. It is based on scale out architecture which can affordably store big volume of data for future use. Data are managed through clusters based on distributed file systems. The technique used in mapping the data result in faster data processing Fast Flexible Failure Resistant Hadoop enables enterprises to access and process data in a very easy way to generate the values required, thereby providing the enterprises with the tools to get valuable insights from various types of data sources operating in parallel. One of the great advantages of Hadoop is its fault tolerance, which is provided by replicating the data to another node in the cluster. The data from the replicated node can be used in the event of a failure. Hadoop core Advantages
  • 8.
    Hadoop Limitations Vulnerable Latency Inaptness with smalldata Stability Issues Security Concern Hadoop is written in java which is most used language, and been most heavily exploited by cyber attackers and as a result, implicated in numerous security breaches. Hadoop is not suited for small data. HDFS lacks the ability to efficiently support the random reading of small files because of its high capacity design. Hadoop being an open source platform has a Fair possibilities of stability issues. HDFS is optimized to access batches of data set quicker (high throughput), rather than particular records in that data set (low latency) Hadoop is missing encryption at storage and network levels, which is a major concern. Hadoop supports Kerberos authentication, which is not easy to manage
  • 9.
    Some scenarios wherepower of Hadoop is needed to strengthen the Data Warehouse  Storage and Processing of semi structured and un structured data  Reducing the cost of Data Storage in case of huge data volumes  Increase Data retention to avoid premature data death  Pre processing of big volume of data
  • 10.
    CRM ERP Legacy Source Systems Third Party ExternalData Extract Transform & Load Enterprise Data Warehouse ODS Data Mart Data Mart Analytics ETL Layer Data Repository Layer Analytics Layer Conventional Data Warehouse Architecture This is traditional Data Warehouse Architecture which is being used for many organizations. There are some variance to this based on technical and organizational needs.
  • 11.
    Unstructured Data Sources Semi structured DataSources Structured Data Sources Enterprise Data Warehouse Advance Analytical Applications Business Intelligence Layer In this use case, Hadoop is being used for loading the unstructured and semi structured data and making it available for EDW based on the organizations requirement and also offering it for further analytical processing. The integration of new data sources into the existing EDW will empower organizations more and deeper analytics and insights.
  • 12.
    CRM ERP Legacy Third Party External Data Extract Transform& Load Enterprise Data Warehouse ODS Data Mart Data Mart Analytics Unstructured Sources XMLs, Doc Files Web Logs, Emails Images, Videos File Copy Analytic Tools In this use case, Hadoop is being used as a main data repository and data from data warehouse is being archived in Hadoop taking advantage of its low cost storage. Data warehouse is being taken here as a source for Hadoop. Another point to note here is that there is no change in existing setup of organization's EDW.
  • 13.
    Unstructured Sources Structured Sources CRM ERP Legacy XMLs, Doc Files WebLogs, Emails Images, Videos Enterprise Data Warehouse ODS Data Mart Data Mart Analytics LayerAnalytic tools In this use case, Hadoop is shown as a layer before existing EDW. Sourcing all of the data, Hadoop's capability of parallel processing is being utilized. It offloads majority of transformations from EDW and feed pre processed data. EDW is used to more focus on Aggregations and Analytical reporting.
  • 14.
    Data Sources XMLs, DocFiles Web Logs, Emails Images, Videos CRM ERP Legacy Data Lake Extract & Load Analytic Sandbox Transformation Enterprise Data Warehouse Business Intelligence Layer In this scenario, Data lake is utilized and ELT over ETL is being used. A Data lake is a storage repository that hold a vast amount of raw data in its native form and can be transformed later as per the need. EDW is applying transformations and utilizing the data. This kind of architecture is great for Organization's data science needs where Data Scientists can use sandbox to apply their models on the raw data stored in Data Lake.
  • 15.
    To Conclude.. Data Warehousearchitects have more tools to play with and there is a need of detailed analysis for the organization and business goals before choosing the right set of technologies to build a data warehouse. The core benefits of data warehouse are still in need and will always be. There is always an opportunity to strengthen them by smart use of appropriate tools and technologies. Hadoop can only fail if there is an attempt to use it just for replacement of existing data warehouse without the proper feasibility analysis and intent to come up with optimized architecture aligned with Organizational goals.