SlideShare a Scribd company logo
1 of 19
Download to read offline
The Hadoop Data Refinery and
Enterprise Data Hub
Prepared for:
By Mike Ferguson
Intelligent Business Strategies
May 2014
WHITEPAPER INTELLIGENT	
  
BUSINESS	
  
STRATEGIES	
  
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 2
Table of Contents
Management	
  Summary.................................................................................................... 3	
  
	
  
Introduction	
  -­‐	
  Data	
  Warehousing	
  and	
  the	
  Origins	
  of	
  ETL	
  Processing............................... 5	
  
Scaling	
  Up	
  Data	
  Integration	
  –	
  The	
  Shift	
  from	
  ETL	
  to	
  ELT ..................................... 5	
  
	
  
The	
  Emergence	
  of	
  Big	
  Data	
  and	
  Multiple	
  Analytical	
  Workloads...................................... 6	
  
Characteristics	
  of	
  Multi-­‐structured	
  and	
  Structured	
  Big	
  Data............................... 6	
  
Big	
  Data	
  Analytical	
  Workloads ............................................................................. 7	
  
Hadoop	
  –	
  A	
  Key	
  Platform	
  for	
  Big	
  Data	
  Analytics.................................................. 7	
  
	
  
Building	
  An	
  Enterprise	
  Data	
  Hub	
  Using	
  MapR.................................................................. 8	
  
What	
  Is	
  An	
  Enterprise	
  Data	
  Hub? ........................................................................ 8	
  
The	
  MapR	
  Hadoop	
  Distribution	
  as	
  an	
  Enterprise	
  Data	
  Hub	
  Platform.................. 9	
  
MapR	
  Disaster	
  Recovery	
  and	
  Data	
  Protection ......................................... 9	
  
Hadoop	
  Workloads	
  and	
  MapR	
  Extensions ............................................. 10	
  
The	
  Data	
  Refinery	
  -­‐	
  Accelerating	
  ETL	
  Processing	
  at	
  Low	
  Cost............................ 10	
  
The	
  Data	
  Refinery	
  -­‐	
  Exploratory	
  Analysis........................................................... 12	
  
Accelerating	
  Big	
  Data	
  Consumption	
  and	
  Filtering	
  Using	
  Automated	
  	
  
Analytics	
  During	
  in-­‐Hadoop	
  ELT	
  Processing ........................................... 12	
  
Key	
  MapR	
  Features	
  That	
  Meet	
  Enterprise	
  Data	
  Hub	
  Requirements.................. 13	
  
Data	
  Hub	
  ELT	
  Processing	
  With	
  MapR	
  Hadoop	
  Distributions.............................. 14	
  
Hadoop	
  as	
  a	
  Data	
  Hub	
  for	
  All	
  Analytical	
  Platforms............................................ 15	
  
Feeding	
  Data	
  Warehouses	
  from	
  a	
  Hadoop	
  Data	
  Hub	
  to	
  Produce	
  New	
  Insight	
  	
  
from	
  Enriched	
  Data............................................................................................ 16	
  
Archiving	
  Data	
  Warehouse	
  Data	
  into	
  Hadoop................................................... 17	
  
	
  
Conclusion...................................................................................................................... 18	
  
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 3
MANAGEMENT SUMMARY
Over recent years many companies have seen huge growth in data volumes.
This has come from both existing structured and new semi-structured and
unstructured data sources. The relentless rise in online shoppers, as well as
the convenience of mobile devices, is contributing significantly to accelerating
transaction volumes. Therefore transaction data is rapidly on the increase and
click stream data from online browsing is reaching unprecedented volumes. It
also harbours deep insight into online customer behaviour.
For most companies, the way they analyse sales and other transaction activity
is by extracting this data from e-commerce systems, cleaning, transforming
and integrating it with customer, product and financial data from other core
transaction processing systems, loading it into a data warehouse and then
analysing subsets of it in data marts using business intelligence tools. Over the
years as transaction data has grown and other data sources have become
available for analysis, the challenge of data extraction, transformation and
loading (ETL) has become increasingly difficult to scale. ETL tools have
switched to ELT to extract and load data into data warehouse staging tables
first before using the power of parallel SQL processing in a massively parallel
database to deal with scalability. However today, the momentum behind the
use of online channels as the preferred way of transacting business and
interacting with companies has become so great that data volumes are
increasing at rates we have not seen before. Click stream, inbound emails
interactions, social media, interactions, and sensor data are taking data
volumes to new heights. The result is that staging areas in data warehouses
set aside for ETL processing are becoming so large that ETL processing on its
own is driving expensive upgrades to data warehouses to handle the workload.
In addition, analysis of complex data is also happening on Hadoop to analyse
new data types such as text, JSON, clickstream, images and video.
This paper looks at an alternative solution—the creation of an Enterprise Data
Hub with data landing zone, data refinery on a much lower cost Hadoop
platform that can scale to manage increasing data volumes as well as integrate
structured master and transaction data with more complex high value data like
clickstream, and multi-structured interaction data. In addition we look at how
Hadoop can be used as an analytical platform to support exploratory analysis
of raw data within a data refinery in the Enterprise Data Hub to produce new
insights that can be published and offered up to business analysts for use in
further analyses. From here, business analysts throughout the enterprise can
subscribe to receive new insights into traditional data warehouses and data
marts so as to enrich what companies already know with the intent to deliver
competitive advantage in existing and new markets. We will also look at how
Hadoop can act as a long-term data store for big data as well as an on-line
archive for data warehouse data that is no longer analysed on a frequent basis.
MapR is a Hadoop vendor that has enhanced its MapR M5 Edition and MapR
M7 Edition to support high availability features such as JobTracker HA™ and
No NameNode HA™, MapR Direct Access NFS™, snapshots for online point-
in-time data recovery, automatic data compression, remote mirroring, disaster
recovery, and data protection. Its disaster recovery and data protection
features make M5 and M7 capable of becoming long-term low cost data store
The switch to online
channels is driving
unprecedented volumes
of transaction data and
ciickstream data
This is driving up the
cost of data
warehousing as staging
areas holding data for
ETL processing grow
rapidly
Companies are looking
to lower the cost of data
warehousing by
archiving data and
offloading processing
Hadoop offers a
complementary low cost
alternative that supports
big data analytics and
the ability to offload ETL
processing
MapR has created an
enterprise-grade
Hadoop platform that
supports long term data
storage, data
warehouse archive,
offloading of ETL
processing and big data
analytics
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 4
where new big data sources can be analysed and archived data from
traditional data warehouses can be stored and selectively reprocessed. In
addition, M5 and M7 offer workload management support allowing a Hadoop
cluster to be logically divided to support different use cases, job types, user
groups, and administrators. Also jobs can be isolated. All this helps support
multiple workloads and allows usage to be managed and tracked.
These capabilities make MapR an enterprise-grade Hadoop platform capable
of supporting an enterprise data hub encompassing a data landing zone and
data refinery where data can be cleaned, integrated and analysed by data
scientists to produce new insights for competitive advantage. These new
insights can then be supplied to data warehouses, data marts and other
analytic platforms forming the data foundation of a multi-platform analytical
ecosystem.
It can also act as an
enterprise data hub to
supply data to a multi-
platform analytical
ecosystem
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 5
INTRODUCTION - DATA WAREHOUSING AND THE
ORIGINS OF ETL PROCESSING
For many years, companies have been building data warehouses to analyse
business activity and produce insights for decision makers to act on to improve
business performance. These traditional analytical systems are often based on
a classic pattern where data from multiple transaction processing systems is
captured, cleaned, transformed and integrated before loading it into a data
warehouse.
Initially, the challenge of capturing, cleaning and integrating data was the role
of IT programmers who wrote hand-crafted code to extract, transform and load
(ETL) data from multiple sources into newly designed data warehouse
databases for subsequent analysis and reporting. Soon however, new software
ETL tools emerged to take on this task and improve productivity. Some of
these tools generated 3GL and 4GL code to do the work while others
interpreted graphically defined rules at run time. ETL execution involved
extracting data from multiple operational systems, moving the data to the ETL
server, and transforming and integrating it on the server before loading it into a
target data warehouse. In the early years as customer demand grew, vendors
added support for more and more structured data sources including popular
packaged transaction processing applications, new file formats and popular
external data providers.
However, more data sources led to larger data volumes causing many
customers to start hitting performance limitations especially when data was
being totally refreshed. ETL tool vendors responded by adding support for
change data capture, but even so, the problem of ETL performance emerged
again as business demand for data increased.
SCALING UP DATA INTEGRATION – THE SHIFT FROM ETL TO ELT
To counter this problem, many ETL vendors began to look at new ways of
achieving scalability. One of the most popular ways adopted to do this was to
exploit parallel query processing in massively parallel (MPP) relational DBMSs.
Rather than just loading transformed data from an ETL server into target MPP
RDBMSs, several ETL vendors realised that they could boost performance by
capturing data from multiple data sources, loading it into staging tables on a
target MPP RDBMS and then generating SQL to transform the data using
massively parallel query processing in the DBMS. The result was significant
performance improvement that also made it possible for transformed,
integrated data to then be moved from staging tables into production as a
“within the box” process on the same RDBMS platform. This approach gave
rise to the term Extract, Load, Transform (ELT) whereby MPP RDBMSs took
data integration scalability to a new level.
Extract, Transform
and Load (ETL) tools
emerged in the early
years of data
warehousing to
extract, clean,
transform and
integrate data from
multiple transaction
processing systems
into data
warehouses
ETL tools while
successful,
experienced
performance
problems as the
demand for data grew
ETL tools vendors
switched to loading
data into MPP
RDBMS staging
tables first and then
used SQL to
transform it in
parallel. This
became known as
ELT processing
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 6
THE EMERGENCE OF BIG DATA AND MULTIPLE
ANALYTICAL WORKLOADS
Although this traditional environment is now mature, many new more complex
types of data have now emerged that businesses want to analyse to enrich
what they already know. In addition, the rate at which much of this new data is
being created and/or generated is far beyond what we have ever seen before.
Customers and prospects are creating huge amounts of new data on social
networks and review web sites. In addition, online news items, weather data,
competitor web site content, and even data marketplaces are now available as
candidate data sources for business consumption.
Within the enterprise, web logs are growing at staggering rates as customers
switch to online channels as their preferred way to transact business and
interact with companies. Also, increasing amounts of sensor networks and
machines are being deployed to instrument and optimise business operations.
The result is an abundance of new “big data” sources, rapidly increasing data
volumes and a flurry of new data streams that all need to be analysed.
CHARACTERISTICS OF MULTI-STRUCTURED AND STRUCTURED BIG DATA
The characteristics of these new data sources are different from the structured
data that has been analysed in data warehouses for the last twenty years. For
example, the variety of data types being captured now includes:
• Structured data
• Semi-structured data, e.g. XML, HTML
• Unstructured data, e.g. text, audio, video
• Machine-generated data, e.g. sensor data
Semi-structured data such as XML allows navigation of XML paths to occur to
go deeper into the content to derive business value. Unstructured text requires
text mining to parse the data to derive structured data from unstructured while
also building full text indexing. Deriving insight from unstructured sound and
video data is more challenging but even here, demand is growing especially
from government agencies and law enforcement.
In addition to data variety, the volumes of data are also increasing
Unstructured and machine-generated data in particular, can be very large in
volume. However, volumes of structured transaction data are also increasing
rapidly mainly because of the growth in the use of online channels from
desktop computers and mobile devices. One side effect of much larger
transaction volumes is that staging tables on data warehouse that hold data
awaiting ELT processing are increasing rapidly, which in turn is forcing
companies to upgrade data warehouse platforms, often at considerable cost, to
hold more data.
Finally, the rate (velocity) at which data is being generated is also increasing.
Clickstream data, sensor data and financial markets data are good examples of
this and are sometimes referred to as data streams.
Businesses now want
to analyse new more
complex types of data
to add new insights to
what they already
know
Social network data,
web logs, archived
data warehouse data
and sensor data are
all new data sources
of attracting
analytical attention
The variety of data is
more complex than
traditional data
warehousing with
multi-structured data
now in demand
Big data can be
much larger in
volume
Machine-generated
data is being
created at very
high rates
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 7
BIG DATA ANALYTICAL WORKLOADS
The arrival of big data and big data analytics has taken us beyond the
traditional analytical workloads seen in data warehouses. Examples of new
analytical workloads include:
• Analysis of data in motion
• Complex analysis of structured data
• Exploratory analysis of un-modeled multi-structured data
• Graph analysis e.g. social networks
• Accelerating ETL processing of structured and multi-structured data to
enrich data in a data warehouse or analytical appliance
• The long term storage and reprocessing of archived data warehouse
data for rapid selective retrieval
These new analytical workloads are more likely to be processed outside of
traditional data warehouses and data marts on platforms more suited to these
kinds of workloads.
HADOOP – A KEY PLATFORM FOR BIG DATA ANALYTICS
One key platform that has emerged to support big data analytical workloads is
Apache Hadoop. The Hadoop software “stack” has a number of components
including:
Component Description
Hadoop
HDFS
A distributed file system that partitions large files across
multiple machines for high-throughput access to data
Hadoop
YARN
A framework for job scheduling and cluster resource
management
Hadoop
MapReduce
A programming framework for distributed batch processing of
large data sets distributed across multiple servers
Hive A data warehouse system for Hadoop that facilitates data
summarization, ad-hoc queries, and the analysis of large
datasets stored in Hadoop-compatible file systems. Hive
provides a mechanism to project structure onto this data and
query it using a SQL-like language called HiveQL. HiveQL
programs are converted into MapReduce programs
HBase An open-source, distributed, versioned, column-oriented store
modeled after Google's BigTable
Pig A high-level data-flow language for expressing MapReduce
programs for analyzing large HDFS distributed data sets
Mahout A scalable machine learning and data mining library
Oozie A workflow/coordination system to manage Hadoop jobs
Spark A general purpose engine for large scale data processing in-
memory. It supports analytical applications that wish to make
use of stream processing, SQL access to columnar data and
analytics on distributed in-memory data
Zookeeper A coordination service for distributed applications
Big data has created
new analytical
workloads beyond
those typical of
traditional data
warehouses and data
marts
Hadoop has emerged
as a platform very
much at the centre of
big data analytics
Hive is a data
warehouse system
for Hadoop that
provides a
mechanism to project
structure on Hadoop
data
Hive provides an
interface whereby
SQL can be
converted into
MapReduce
programs
Mahout offers a
whole library of
analytics that can
exploit the full power
of a Hadoop cluster
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 8
BUILDING AN ENTERPRISE DATA HUB USING MAPR
WHAT IS AN ENTERPRISE DATA HUB?
Having discussed the characteristics of new sources of data, new
analytical workloads that now need to be supported and Hadoop as a
key platform for analytics, a key question is “How does Hadoop fit into
an existing analytical environment?” A key emerging role for Hadoop is
that of an Enterprise Data Hub as shown in Figure 1.
Figure 1
An enterprise data hub is a managed and governed Hadoop environment in
which to land raw data, refine it and publish new insights that can be delivered
to authorised users throughout the enterprise, either on-demand or on a
subscription basis. These users may want to add the new insights to existing
data warehouses and data marts to enrich what they already know and/or
conduct further analyses for competitive advantage.
The Enterprise Data Hub consists of:
• A managed data reservoir
• A governed data refinery
• Published, protected and secure high value insights
• Long-term storage of archived data from data warehouses
The$Managed$Hadoop$Enterprise$Data$Hub$Includes$A$Data$
Reservoir$and$A$Data$Refinery$And$A$Zone$For$New$Insights$
Parse$&$Prepare$Data$in$Hadoop$(MapReduce)$
Transform$&$Cleanse$Data$in$Hadoop$(MapReduce)$
$Discover$data$in$Hadoop$
ELT$
work$
Jflow$
sandbox$
other$data$
sandbox$ sandbox$
Data$
Reservoir$
Load$data$into$Hadoop$
Data$
Refinery$
New$high$value$
Insights$
(pub/sub)$
EDW$
Graph$
DBMS$$
DW$
appliance$
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 9
All of this is made available in a secure, well-governed environment. Within the
enterprise data hub, a data reservoir is where raw data is landed, collected and
organised before it enters into the data refinery where data and data
relationships are discovered, data is parsed, profiled, cleansed, transformation
and integrated. It is then made available to data scientists who may combine it
with other trusted data such as master data or historical data from a data
warehouse before conducting exploratory analyses in a sandbox environment
to identify and produce new business insights. These insights are the output of
the data refining process. They are made available to other authorised users in
the enterprise by first describing them using common vocabulary data
definitions and then publishing them into a new insights zone where they
become available for distribution to other platforms and analytical projects. In
addition, cold data not being used
THE MAPR DISTRIBUTION FOR HADOOP AS AN ENTERPRISE DATA HUB
PLATFORM
MapR is a vendor that provides a Hadoop platform upon which to build a
managed Enterprise Data Hub. MapR was founded in 2009, is based in San
Jose, California, and offers three editions of its Hadoop Distribution. These are:
MapR
Distribution for
Apache Hadoop
Description
MapR M3
Standard
Edition
A free community edition that includes HBase™, Pig, Hive,
Mahout, Cascading, Sqoop, Flume etc. It includes POSIX-
compliant NFS file system access.
MapR M5
Enterprise
Edition
This enterprise edition includes HBase™, Pig, Hive, Mahout,
Cascading, Sqoop, Flume, Impala, Spark etc. M5 edition is a
no single point of failure edition with high availability and
data protection features such as JobTracker HA, no
NameNode HA, Snapshots and Mirroring to synchronise
data across clusters.
MapR M7
Enteprise
Database
Edition
The MapR M7 Database Edition includes all the capabilities
of M5 plus enterprise-grade modifications to HBase to make
it more dependable and faster.
MapR Disaster Recovery and Data Protection
MapR has also strengthened Hadoop by adding support for disaster recovery
and data protection to their M5 and M7 Hadoop distribution offerings.
In the area of disaster recovery, MapR provides remote mirroring to keep a
synchronized copy of data at a remote site, so that processing can continue
uninterrupted in the case of a disaster. Management of multiple on-site or
geographically dispersed clusters is available with the MapR Control System.
With respect to data protection, MapR has no single points of failure, with no
NameNode HA and distributed cluster metadata. MapR Snapshot provides
point-in-time recovery while MapR Mirroring offers business continuity.
Everything in the MapR distribution is logged,	
  and able to restart with the intent
MapR has three
editions of its
distribution for
Hadoop
MapR provides a
distribution for
Hadoop that
includes over 20
Apache Hadoop
projects
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 10
that the entire cluster is self-healing and self-tuning. The JobTracker and
NameNode have been re-engineered to be distributed and replicated. Direct
Access NFS HA means that clients do not idle waiting for unavailable servers
and rolling upgrades make sure that the cluster is always available. In addition,
workload management is also supported including job isolation, job placement
control, logical volumes, SLA enforcement and enterprise access control to
isolate and secure data access.
Hadoop Workloads and MapR Extensions
Specific examples of where Hadoop is particularly well suited include:
• Offloading and accelerating data warehouse ELT processing at low cost
• Exploratory analysis of un-modeled multi-structured data
• Extreme analytics – for example having to run millions of scoring
models concurrently on millions of accounts to detect “cramming” fraud
on credit cards. This is an activity whereby fraudsters attempt to steal
small amounts of money from large numbers of credit card accounts by
associating false charges with vague financial services and hoping
consumers just don’t notice. Running millions of analytical models
concurrently on data is typically not a workload you would see running
in a data warehouse.
• The long term storage of data and reprocessing of archived data
warehouse data for rapid selective retrieval
These are all workloads that you would expect to find in an Enterprise Data
Hub. The MapR enhancements to the underlying data platform which powers
their Hadoop distribution, provide capabilities needed to support these
including continuous data capture, offloading of ELT processing, exploratory
analytics, long term storage of archived warehouse data, and selective retrieval
of it for analytical processing. Let’s look at these in more detail with particular
focus on the data refining process and offloading ELT processing and some
analytical workloads from data warehouses.
THE DATA REFINERY - ACCELERATING ETL PROCESSING AT LOW COST
The evolution of ETL on big data platforms like Hadoop has mirrored that on
traditional data warehouses. First of all, hand-crafted ETL programs were
written to provision data into Hadoop, transform and integrate it for exploratory
analysis. The problem with this approach is that even if these programs exploit
the multi-processor, multi-server Hadoop platform, development is slow and
expensive requiring scarce MapReduce programming skills.
ETL tool vendors responded by announcing support for Hadoop as both a
target to provision data for exploratory analysis and a source to move derived
insights from Hadoop into data warehouses. However, while this approach
works, ETL processing occurs outside the Hadoop environment and so is
unable to exploit the scalability of the Hadoop platform to deal with the
characteristics of big data.
In order to get scalability, ETL vendors have evolved their products to exploit
Hadoop by implementing ELT processing much like it did on data warehouse
systems. The difference now, however, is that all the data is being loaded into
a Hadoop cluster for ELT processing via generated 3GL, Hive or Pig ELT jobs
Hand-crafted ETL
programs were
initially created in a
Hadoop environment
ETL servers then
emerged to handle
big data integration
to load into Hadoop
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 11
running natively on a low cost Hadoop cluster. This is shown in Figures 2 and
3. It is this capability that is so attractive to many companies who are looking
for a way to offload ELT processing from data warehouses and create an
enterprise data hub. Offloading ELT processing to Hadoop frees up
considerable capacity on data warehouse platforms thereby potentially
avoiding expensive data warehouse upgrades. This is especially significant as
transaction data volumes continue to grow and new big data sources become
available for analysis.
Figure 2
Several ETL tool vendors have now re-written transforms to run on Hadoop.
Also, several have added new tools and transformations to handle large
volumes of multi-structured data. Therefore what we are seeing in Hadoop
environments is a full repetition of what happened with ETL tools on MPP
RDBMSs. This time however, the attraction is that ELT processing can
potentially be done at a much lower cost given that a Hadoop cluster is a much
cheaper platform to store any kind of data. It also opens up the way for ELT
processing to be offloaded from data warehouses and to exploit the full power
of a Hadoop cluster to get the scalability needed to improve performance in a
big data environment.
Figure 3
Data$Cleansing$and$Integra/on$Tool$$
Extract Parse Clean Transform AnalyseLoad Insights$
Op/on$1$
ETL$tool$generates$HQL$or$
convert$generated$SQL$to$
HQL$
Op/on$2$
ETL$tool$generates$
Pig$La/n$
(compiler$converts$
every$transform$to$a$
map$reduce$job)$
Op/on$3$
ETL$tool$generates$
3GL$MapReduce$code$
Scaling$ETL$Transforma/ons$By$Genera/ng$Pig,$Hive$or$
3GL$MapReduce$Code$for$InMHadoop$ELT$Processing$
Provisioning Data Into Hadoop for Exploratory Analysis of
Multi-Structured Data Using In-Hadoop ELT processing
Web
logs
Generated
MapReduce
ELT jobs
business
insight
sandbox sandboxUn-modelled multi-
structured data
Structured
data
Filtered
sensor
data
ELTProcessing
ETL servers that
handle big data
cleansing and
integration outside of
Hadoop are unlikely to
scale well – they need
to exploit Hadoop
ETL processing in a
big data environment
needs to exploit
Hadoop to get
scalability at low cost
Several ETL vendors
have ported their
software to Hadoop
to run ELT map-
reduce processing
on multi-structured
big data
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 12
The aforementioned attractions of the Figure 3 pattern are leading many
companies to consider placing a significant slice of their structured and multi-
structured data into Hadoop Enterprise Data Hub for ELT processing before
making subsets of it available exploratory analysis on Hadoop itself or to other
data warehouse and NoSQL platforms. The only challenge with this is the
migration of existing ELT jobs running on existing data warehouses which is
helped by Hive being able to convert ELT-generated SQL (used to transform
data) into map reduce or potentially even in-memory Spark programs.
THE DATA REFINERY - EXPLORATORY ANALYSIS
In addition to ETL processing, another analytical workload very much part of
the data refining process in a Hadoop Enterprise Data Hub is the exploratory
analysis of complex data types. This is where data scientists in the enterprise
data hub often use freeform exploratory tools like search and/or develop and
run batch MapReduce or Spark analytic applications (written in languages like
Java, Python, Scala and R) to conduct exploratory analyses on un-modelled
data stored in the Hadoop system. The purpose of this analysis in the data
refining process, is to derive structured insight from unstructured data that may
then be stored in HBase, Hive or moved into a data warehouse for further
analysis. With Hadoop MapReduce, these analytical programs are copied to
thousands of compute nodes in a Hadoop cluster where the data is located in
order to run the batch analysis in parallel. In addition, in-Hadoop analytics in
the Mahout library can run in parallel close to the data to exploit the full power
of a Hadoop cluster. The addition of Spark means that MapR can improve
performance of exploratory analytical applications by exploiting in-memory
processing. Data access can also be simplified by using SQL via Shark on
Spark instead of lower level HDFS APIs. Insight derived from this exploratory
analysis can then be published and moved into data warehouses to enrich
value or to other analytical data stores for further analysis.
Accelerating Big Data Consumption and Filtering Using
Automated Analytics During in-Hadoop ELT Processing
One of the challenges with big data is dealing with the data deluge. Companies
have to step up to the challenge that data is arriving faster than they can
consume it. They therefore have to find a way to automate the consumption
and refining of popular data to deal with the data deluge and bring data into the
enterprise in a timely way for the business to analyse and act on.
That means not only doing ETL processing on Hadoop, but also being able to
analyse data during ELT processing as part of the data refining process. An
example might be to score Twitter sentiment during ELT processing of Twitter
data on Hadoop so that negative sentiment can be identified quickly and
attributed to customers, products, brand or business functions like customer
service.
Figure 4 takes in-Hadoop ELT processing further than in Figure 1 by doing in-
line automated analytics on Hadoop during ELT processing. In this way,
popular structured and multi-structured data sources may be consumed and
refined in a more automated way, thereby expediting time to value. In addition,
if data scientists have built custom map reduce or Spark based analytics on
this kind of data, then it is potentially possible to exploit these analytics during
ELT processing. This means that once data scientists have built analytics that
analyse data, they can be used and re-used in analyses to produce insight
Embedding in-
Hadoop analytics in
Hadoop-based ELT
processing allows
data to be
consumedmore
rapidly and in a more
automated way
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 13
from new big data sources. Note how in Figure 4 automated analysis allows
the ELT workflow to span entire data refining process.
Figure 4
In this way it becomes possible to build an “Enterprise Data Filter” that can
speed consumption of new and existing data sources to expedite the
production of new high value business insights.
KEY MAPR FEATURES THAT MEET ENTERPRISE DATA HUB REQUIREMENTS
Given what is potentially possible, the next question is ‘How has MapR
enhanced its Hadoop distributions to support the operation of an Enterprise
Data Hub?” Since its inception, MapR has sought to enhance a critical part of
the open source Apache Hadoop stack to improve availability, open up access
and improve overall performance and usability.
MapR has strengthened Apache Hadoop considerably to improve its resilience,
improve performance and make it easier to manage. For example, they have
removed multiple single points of failure in Apache Hadoop and introduced
data mirroring across clusters, using asynchronous replication, to support
failover and disaster recovery. In addition, they have added data snapshots, a
heat map management console and have improved performance through data
compression and by rewriting the intermediate shuffle phase that occurs after
Map and before Reduce. HBase has also been strengthened (MapR M7
Edition) to remove compactions and Spark has been added to facilitate high
performance in-memory analytics. All of this makes the MapR distributions
much more enterprise-grade.
Key features of MapR M5 and M7 that benefit ETL processing include:
The$Managed$Hadoop$Enterprise$Data$Hub$Includes$Automated$
Analyses$to$Refine$Data$Much$More$Rapidly$
Parse$&$Prepare$Data$in$Hadoop$(MapReduce)$
Transform$&$Cleanse$Data$in$Hadoop$(MapReduce)$
$Discover$data$in$Hadoop$
ELT$
work$
Gflow$
other$data$
Data$
Reservoir$
Load$data$into$Hadoop$
Data$
Refinery$
New$high$value$
Insights$
(pub/sub)$
EDW$
Graph$
DBMS$$
Automated$InvocaOon$of$Custom$Built$&$PreGbuilt$
AnalyOcs$on$Hadoop$
DW$
appliance$
MapR has
strengthened
Apache Hadoop to
improve resilience
and performance
MapR features that
benefit ETL
processing include
high availability and
Direct Access NFS
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 14
• No single point of failure edition with high availability features such as
JobTracker High Availability and no-NameNode High Availability. In a
global business where ETL processing may need to happen several
times within the day, high availability is very important
• MapR Direct Access™ NFS which enables real-time read/write data
flows via the industry-standard Network File System (NFS) protocol.
With MapR Direct Access NFS, any remote client can simply mount the
cluster. This means that application servers can write their log files and
other data directly into the cluster, rather than writing it first to direct- or
network-attached storage. This reduces the need for log collection
tools that may require agents on every application server. Application
servers can either write data directly into the cluster or use standard
tools like Rsync to synchronize data between local disks and the
cluster. Either way, it means that ELT processing on log data needed
for clickstream analytics could potentially avoid the extract and load into
MapR M5/M7, thus speeding up the process of making this data
available for analysis. It also reduces data latency which can be
important in many applications.
• MapR Snapshots allow for online point-in-time data recovery without
replication of data. A volume snapshot is consistent (atomic), does not
copy data, and does not impact performance. Snapshots can potentially
help ETL processing in the event of a failure where an ETL job may
need to be restarted from the point of failure or at least from an
intermediate snapshot taken at specific points in the ETL processing.
• The re-writing of the intermediate shuffle phase that occurs after Map
and before Reduce can really help improve ETL performance for ETL
tools generating map-reduce ETL jobs via Hive, Pig or natively with a
3GL language such as Java.
• Automatic data compression can also help improve performance and so
speed up data refinery processes.
• The MapR data protection and disaster recovery capabilities make the
MapR Distributions for Hadoop suitable for long-term storage of big
data and data warehouse archived data which can then be selectively
reprocessed in specific analyses even though it is offloaded from the
data warehouse.
• The MapR remote mirroring capability also allows ELT and analytical
workloads in a data refinery to be spread across clusters in order to get
more work done.
DATA HUB ELT PROCESSING WITH MAPR HADOOP DISTRIBUTIONS
With respect to ELT processing on Hadoop, MapR partners with a number of
ETL tool vendors and ETL accelerator vendors that can run ELT jobs on MapR
M5 and M7 Hadoop clusters. These include:
• Informatica
• Pentaho
• Talend
Direct Access NFS
speeds up the ability
to capture data and
support change data
capture
MapR snapshots
help to support
‘point-in-time’ restart
of big data ETL
processing in the
event of a failure
without going back
to the beginning
Rewrite of shuffle
and data
compression helps
improve ETL
processing
performance
MapR has several
ETL partners that
run on Hadoop to
accelerate ETL
processing
MapR together with
its ETL partners can
support most ETL
patterns
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 15
• Syncsort
Using these partner technologies in combination with the MapR M5 and M7
editions, the following patterns are supported:
• Accelerating big data consumption and filtering by using in-Hadoop
analytics during ELT processing
• In-Hadoop ELT processing via MapReduce-based transformations
• Provisioning data into Hadoop sandboxes for exploratory analysis as
part of a data refining process
• Feeding data warehouses from Hadoop to accelerate multi-platform
analytics
In terms of scalability, adding more Hadoop nodes to the cluster allows you to
process this data at speed due to greater I/O parallelism and compute power.
For very large amounts of data, a MapR Hadoop cluster can be spun up in a
cloud environment like the Google Compute Engine to undertake this work at a
much lower cost than trying to configure a cluster of similar size in-house.
With respect to restart of ELT processing, MapR snapshots taken at specific
points in ELT processing makes it possible to restart any data refinery ELT
process quickly.
MapR also provides random read/write access in its Hadoop distribution. In
Apache Hadoop, HDFS is normally append-only, but one of the key features of
the MapR Distribution is Direct Access NFS. In the context of ELT, Direct
Access NFS allows faster and more convenient loading of data into the
Hadoop cluster, thereby reducing data latency. ETL tools that support Change
Data Capture can write changes straight into the MapR Hadoop cluster. An
example of a MapR partner ETL tool vendor that can do this is Talend.
Change data capture is very important to ETL performance, especially on large
volumes of data.
Finally in terms of ELT performance, the re-writing of the intermediate shuffle
phase that occurs after Map and before Reduce will benefit sorting,
aggregation, hashing and pattern matching transformations all of which are
mainstream transformation functionality needed in most ELT jobs. This
functionality, together with data compression, will boost performance.
These features allow MapR to support the following key ETL patterns.
HADOOP AS A DATA HUB FOR ALL ANALYTICAL PLATFORMS
Given these enhancements, the MapR Distribution could potentially be used
not only to offload processing from data warehouses but also to create a low
cost data hub (see Figure 5). An Enterprise Data Hub is the foundation pattern
in data and new insight provisioning in a multi-platform analytical environment.
It would be possible to use MapR M5 or M7 as an Enterprise Data Hub that
cleans, transforms and integrates data from multiple structured and multi-
structured sources and provisions trusted data into any analytical platform in a
big data analytical ccosystem for subsequent analysis. This includes:
• MapR M5/M7 Hadoop distribution itself, where sandboxes are created
for data scientists to conduct exploratory analysis as part of a data
refining process
ETL processing can
take place on
premises or in the
cloud
ETL jobs can be
designed with restart
in mind using MapR
snapshots
ETL jobs can handle
change data capture
using MapR Direct
Access NFS
ETL performance is
accelerated using
MapR shuffle
processing and data
compression
Archiving data from
data warehouses
into Hadoop is also
needed
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 16
• Enterprise data warehouses
• Data marts
• Analytical appliances
• Other NoSQL databases e.g. graph databases
Figure 5
FEEDING DATA WAREHOUSES FROM A HADOOP DATA HUB TO PRODUCE
NEW INSIGHT FROM ENRICHED DATA
Having transformed data on Hadoop and produced insights from it, there is a
need to add any new insights produced to existing environments to add to what
is already known. This means being able to also have ETL tools extract
derived insights produced on Hadoop from that platform and integrate them
with other structured data going into a data warehouse (see Figure 6). This
may happen on Hadoop itself (i.e. push data into a data warehouse) or outside
of Hadoop to pull the data into a data warehouse. In this way we can facilitate
multi-platform analytics that may start by analysing data on Hadoop and end up
offering new insights to self-service BI users accessing a data warehouse.
By embedding analytics in Hadoop ELT processing, it is also potentially
possible to turn ELT workflows into multi-platform analytical workflows.
Data Hub - Consume, Clean, Integrate, Analyse And Provision
Data From Hadoop To Any Analytical Platform
Generated
MapReduce
ELT jobs
business
insight
sandbox
ELT Processing
feedssensors
!"#$%
&'()%
RDBMS Files office docssocial Cloud
*+,*-./0123%
Web logs
web services
NoSQL DB
e.g. graph DB
EDW
DW & marts
mar
t
DW
Appliance
Advanced Analytics
(structured data)
Exploratory analysis
ETL tools also need
to extract data from
Hadoop and provide
it to data
warehouses and
other NoSQL data
stores
Providing new
insights from
Hadoop into data
warehouses is a
very common
requirement
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 17
Figure 6
ARCHIVING DATA WAREHOUSE DATA INTO HADOOP
In order to maximise the value from ETL processing in a big data environment,
it must be possible to move data from Hadoop into other NoSQL and relational
analytical platforms and vice-versa. This includes orchestrating multi-platform
analytical ETL workflows to solve complex analytical problems.
Figure 7
Figure 7 shows this capability. With two-way data movement it becomes
possible to take dimension data into Hadoop and archive data from data
warehouses into Hadoop. It also becomes possible to manage data across all
data stores and analytical platforms in Hadoop. What this also shows is that
data management software has to scale much more than before, not just to
handle big data volumes, but also to handle data movement across platforms
during analytical processing. It will also be used for data archiving across
platforms. Therefore ETL scalability on a robust, highly available self-healing
Hadoop platform like MapR is even more important going forward.
Leveraging Hadoop for Data Integration on Massive Volumes
of Data to Bring Additional Insights Into a DW
Cloud Data
HDFS
Extract
DW
E
T
LMap/ Reduce data
transformation
and analytics
applications
Transform
e.g. PIG, JAQL
Cloud Data e.g. Deriving insight from huge
volumes of social web content on
sites like Twitter, Facebook. Digg,
Myspace, TripAdvisor, LinkedIn….for
sentiment analytics
Hundreds of
terabytes up
to petabytes
relevant
insight
Operational
systems
EDW
MDM SystemDW & marts
NoSQL DB
e.g. graph DB
Advanced Analytic
(multi-structured data)
mart
DW
Appliance
Advanced Analytics
(structured data)
Need to Manage the Supply of Consistent Data And
Archive Data Across The Entire Analytical Ecosystem
Enterprise Information Management Tool Suite
Stream
processing
C
R
U
D
Prod
Asset
Cust
actions
feedssensors
!"#$%
&'()%
RDBMS Files office docssocial Cloud
*+,*-./0123%
Web logs
web services
New
New
New
New
New New New New NewNew
New
New
C
R
U
D
Prod
Asset
Cust
Archiving data from
data warehouses
into Hadoop is also
needed
There is a need for
two-way movement
of data between
Hadoop and other
data stores and to
manage data across
all analytical
platforms in a big
data ecosystem
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 18
CONCLUSION
The emergence of new data sources and the need to analyse everything from
unstructured data to live event streams has led many organisations to realise
that the spectrum of analytical workloads is now so broad that they cannot all
be dealt with in a single enterprise data warehouse. Companies now need
multiple analytical platforms in addition to traditional data warehouses and data
marts to manage big data workloads. New big data platforms like Hadoop,
stream processing engines, and NoSQL graph DBMSs are all emerging as
platforms optimised for specific analytical workloads that need to be added to
the enterprise analytical setup.
This has resulted in a more complex analytical environment that has put much
more emphasis on data management to keep data consistent across big data
workload-optimised analytical platforms and traditional data warehouses.
ETL software now has to deal with multiple data types, very large data volumes
and high velocity event streams as well as handling traditional ETL processing
into data warehouses. In addition, this software must now deal with the need to
rapidly move data between big data and traditional data warehousing platforms
during the execution of analytical workloads. All this is needed while continuing
to deliver value for money and without causing dramatic increases in cost. Data
refinery processes have to be fast, efficient, simple to use and cost-effective.
Several data management vendors now support Hadoop as both a source and
a target in their ETL tools. They are also generate HiveQL, Pig or Java to
create map reduce ELT processing jobs that fully exploit massively parallel
Hadoop clusters. To support faster filtering and consumption of data, we are
also seeing ETL tools starting to support the embedding of analytics into ETL
workflows so that fully automated analytical workflows can be built to speed up
the rate at which organisations can consume, analyse and act on data.
It is this combination of Hadoop with data management software and in-
Hadoop analytics that opens up the attractive proposition of creating a low cost
Enterprise Data Hub (as shown in Figure 4) that manages and accelerates the
data refinery process in an end-to-end big data analytical ecosystem. The
MapR Distribution for Hadoop is well suited to this role and can also support
the offloading of a subset of analytical processing from data warehouses. The
Enterprise Data Hub is not just for data warehousing however. Its job is to
become the foundation for cleansing, transforming and integrating structured
and multi-structured data from multiple sources before provisioning filtered data
and new insights to any platform in the entire big data analytical ecosystem for
subsequent analysis.
MapR, with its enterprise-grade Hadoop distribution together with its partners,
look ready for the challenge.
Business is now
demanding more
analytical power to
analyse new sources of
structured and multi-
structured data
ETL tools have to scale
to support more data,
more complex
transformations and
faster data loading
Support for Hadoop and
rapid data movement
between Hadoop and
data warehouses is
needed
Hadoop in combination
with ETL processing
offers an attractive low
cost way to implement a
data management hub
for the entire big data
analytical ecosystem
MapR can help
customers take ETL
processing to the next
level
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 19
About Intelligent Business Strategies
Intelligent Business Strategies is a research and consulting company whose
goal is to help companies understand and exploit new developments in
business intelligence, analytical processing, data management and enterprise
business integration. Together, these technologies help an organisation
become an intelligent business.
Author
Mike Ferguson is Managing Director of Intelligent Business Strategies Limited.
As an analyst and consultant he specialises in business intelligence and
enterprise business integration. With over 31 years of IT experience, Mike has
consulted for dozens of companies on business intelligence strategy, big data,
data governance, master data management, enterprise architecture, and SOA.
He has spoken at events all over the world and written numerous articles. He
has written many articles, and blogs providing insights on the industry.
Formerly he was a principal and co-founder of Codd and Date Europe Limited
– the inventors of the Relational Model, a Chief Architect at Teradata on the
Teradata DBMS and European Managing Director of Database Associates, an
independent analyst organisation. He teaches popular master classes in Big
Data Analytics, New Technologies for Business Intelligence and Data
Warehousing, Enterprise Data Governance, Master Data Management, and
Enterprise Business Integration.
INTELLIGENT	
  
BUSINESS	
  
STRATEGIES	
  
Water Lane, Wilmslow
Cheshire, SK9 5BG
England
Telephone: (+44)1625 520700
Internet URL: www.intelligentbusiness.biz
E-mail: info@intelligentbusiness.biz
The Hadoop Data Refinery and Enterprise Data Hub
Copyright © 2014 by Intelligent Business Strategies
All rights reserved

More Related Content

What's hot

Optimising Data Lakes for Financial Services
Optimising Data Lakes for Financial ServicesOptimising Data Lakes for Financial Services
Optimising Data Lakes for Financial ServicesAndrew Carr
 
Traditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewTraditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewNagaraj Yerram
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consultingadivasoft
 
Cloudera Enterprise_Data Hub in Telecom
Cloudera Enterprise_Data Hub in TelecomCloudera Enterprise_Data Hub in Telecom
Cloudera Enterprise_Data Hub in TelecomEinsny Phionesgo
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseShanthi Mukkavilli
 
Capgemini Data Warehouse Optimization Using Hadoop
Capgemini Data Warehouse Optimization Using HadoopCapgemini Data Warehouse Optimization Using Hadoop
Capgemini Data Warehouse Optimization Using HadoopAppfluent Technology
 
SAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White PaperSAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White PaperVipul Neema
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Enterprise Archiving with Apache Hadoop Featuring the 2015 Gartner Magic Quad...
Enterprise Archiving with Apache Hadoop Featuring the 2015 Gartner Magic Quad...Enterprise Archiving with Apache Hadoop Featuring the 2015 Gartner Magic Quad...
Enterprise Archiving with Apache Hadoop Featuring the 2015 Gartner Magic Quad...LindaWatson19
 
Fbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesFbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesCindy Irby
 
Data Warehousing - in the real world
Data Warehousing - in the real worldData Warehousing - in the real world
Data Warehousing - in the real worldukc4
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseDataWorks Summit
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...Revolution Analytics
 
Redefining Data Analytics Through Search
Redefining Data Analytics Through SearchRedefining Data Analytics Through Search
Redefining Data Analytics Through SearchConnexica
 
Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304James Kenney
 

What's hot (20)

Optimising Data Lakes for Financial Services
Optimising Data Lakes for Financial ServicesOptimising Data Lakes for Financial Services
Optimising Data Lakes for Financial Services
 
Traditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewTraditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overview
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consulting
 
Cloudera Enterprise_Data Hub in Telecom
Cloudera Enterprise_Data Hub in TelecomCloudera Enterprise_Data Hub in Telecom
Cloudera Enterprise_Data Hub in Telecom
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Oracle sql plsql & dw
Oracle sql plsql & dwOracle sql plsql & dw
Oracle sql plsql & dw
 
Capgemini Data Warehouse Optimization Using Hadoop
Capgemini Data Warehouse Optimization Using HadoopCapgemini Data Warehouse Optimization Using Hadoop
Capgemini Data Warehouse Optimization Using Hadoop
 
SAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White PaperSAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White Paper
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Enterprise Archiving with Apache Hadoop Featuring the 2015 Gartner Magic Quad...
Enterprise Archiving with Apache Hadoop Featuring the 2015 Gartner Magic Quad...Enterprise Archiving with Apache Hadoop Featuring the 2015 Gartner Magic Quad...
Enterprise Archiving with Apache Hadoop Featuring the 2015 Gartner Magic Quad...
 
Fbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesFbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_services
 
Data Warehousing - in the real world
Data Warehousing - in the real worldData Warehousing - in the real world
Data Warehousing - in the real world
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
 
Redefining Data Analytics Through Search
Redefining Data Analytics Through SearchRedefining Data Analytics Through Search
Redefining Data Analytics Through Search
 
Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304
 

Viewers also liked

Paul bartelings - efficiency en SmartDox
Paul bartelings - efficiency en SmartDoxPaul bartelings - efficiency en SmartDox
Paul bartelings - efficiency en SmartDoxTom van den Vlekkert
 
Stage freight
Stage freight Stage freight
Stage freight usman ali
 
Resultaten stellingen Paul Bartelings
Resultaten stellingen Paul BartelingsResultaten stellingen Paul Bartelings
Resultaten stellingen Paul BartelingsTom van den Vlekkert
 
Hans kurvers, event 25 juni Wolters Kluwer.
Hans kurvers, event 25 juni Wolters Kluwer.Hans kurvers, event 25 juni Wolters Kluwer.
Hans kurvers, event 25 juni Wolters Kluwer.Tom van den Vlekkert
 
Pendro Media Pack Single Pages11_Layout 1
Pendro Media Pack Single Pages11_Layout 1Pendro Media Pack Single Pages11_Layout 1
Pendro Media Pack Single Pages11_Layout 1Didier Demif
 
EPV_PCI DSS White Paper (3) Cyber Ark
EPV_PCI DSS White Paper (3) Cyber ArkEPV_PCI DSS White Paper (3) Cyber Ark
EPV_PCI DSS White Paper (3) Cyber ArkErni Susanti
 
Cisco Big Data Use Case
Cisco Big Data Use CaseCisco Big Data Use Case
Cisco Big Data Use CaseErni Susanti
 

Viewers also liked (11)

Paul bartelings - efficiency en SmartDox
Paul bartelings - efficiency en SmartDoxPaul bartelings - efficiency en SmartDox
Paul bartelings - efficiency en SmartDox
 
Peter Kagotho Muchiri C.V.
Peter Kagotho Muchiri C.V.Peter Kagotho Muchiri C.V.
Peter Kagotho Muchiri C.V.
 
Stage freight
Stage freight Stage freight
Stage freight
 
Resultaten stellingen Paul Bartelings
Resultaten stellingen Paul BartelingsResultaten stellingen Paul Bartelings
Resultaten stellingen Paul Bartelings
 
Hans kurvers, event 25 juni Wolters Kluwer.
Hans kurvers, event 25 juni Wolters Kluwer.Hans kurvers, event 25 juni Wolters Kluwer.
Hans kurvers, event 25 juni Wolters Kluwer.
 
TMI - May 29 - Press Reduced
TMI - May 29 - Press ReducedTMI - May 29 - Press Reduced
TMI - May 29 - Press Reduced
 
Pendro Media Pack Single Pages11_Layout 1
Pendro Media Pack Single Pages11_Layout 1Pendro Media Pack Single Pages11_Layout 1
Pendro Media Pack Single Pages11_Layout 1
 
Narender Mendiratta resume
Narender Mendiratta resumeNarender Mendiratta resume
Narender Mendiratta resume
 
EPV_PCI DSS White Paper (3) Cyber Ark
EPV_PCI DSS White Paper (3) Cyber ArkEPV_PCI DSS White Paper (3) Cyber Ark
EPV_PCI DSS White Paper (3) Cyber Ark
 
Cisco Big Data Use Case
Cisco Big Data Use CaseCisco Big Data Use Case
Cisco Big Data Use Case
 
Narender Mendiratta resume
Narender Mendiratta resumeNarender Mendiratta resume
Narender Mendiratta resume
 

Similar to MapR Data Hub White Paper V2 2014

Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaJyrki Määttä
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsFredReynolds2
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunitiesBigdata Meetup Kochi
 
TDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWTDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWJeannette Browning
 
Appfluent and Cloudera Solution Brief
Appfluent and Cloudera Solution BriefAppfluent and Cloudera Solution Brief
Appfluent and Cloudera Solution BriefAppfluent Technology
 
intelligent-data-lake_executive-brief
intelligent-data-lake_executive-briefintelligent-data-lake_executive-brief
intelligent-data-lake_executive-briefLindy-Anne Botha
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
The New Enterprise Blueprint featuring the Gartner Magic Quadrant
The New Enterprise Blueprint featuring the Gartner Magic QuadrantThe New Enterprise Blueprint featuring the Gartner Magic Quadrant
The New Enterprise Blueprint featuring the Gartner Magic QuadrantLindaWatson19
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleVasu S
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
 
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to LifeEvolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to LifeSG Analytics
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBigDataExpo
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineeringNovita Sari
 
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLTBig Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLTKiththi Perera
 
Big data solutions on cloud – the way forward
Big data solutions on cloud – the way forwardBig data solutions on cloud – the way forward
Big data solutions on cloud – the way forwardKiththi Perera
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 

Similar to MapR Data Hub White Paper V2 2014 (20)

Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-cloudera
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunities
 
TDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWTDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DW
 
Appfluent and Cloudera Solution Brief
Appfluent and Cloudera Solution BriefAppfluent and Cloudera Solution Brief
Appfluent and Cloudera Solution Brief
 
intelligent-data-lake_executive-brief
intelligent-data-lake_executive-briefintelligent-data-lake_executive-brief
intelligent-data-lake_executive-brief
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
The New Enterprise Blueprint featuring the Gartner Magic Quadrant
The New Enterprise Blueprint featuring the Gartner Magic QuadrantThe New Enterprise Blueprint featuring the Gartner Magic Quadrant
The New Enterprise Blueprint featuring the Gartner Magic Quadrant
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | Qubole
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Issue in Data warehousing and OLAP in E-business
Issue in Data warehousing and OLAP in E-businessIssue in Data warehousing and OLAP in E-business
Issue in Data warehousing and OLAP in E-business
 
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to LifeEvolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLTBig Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
 
Big data solutions on cloud – the way forward
Big data solutions on cloud – the way forwardBig data solutions on cloud – the way forward
Big data solutions on cloud – the way forward
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 

MapR Data Hub White Paper V2 2014

  • 1. The Hadoop Data Refinery and Enterprise Data Hub Prepared for: By Mike Ferguson Intelligent Business Strategies May 2014 WHITEPAPER INTELLIGENT   BUSINESS   STRATEGIES  
  • 2. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 2 Table of Contents Management  Summary.................................................................................................... 3     Introduction  -­‐  Data  Warehousing  and  the  Origins  of  ETL  Processing............................... 5   Scaling  Up  Data  Integration  –  The  Shift  from  ETL  to  ELT ..................................... 5     The  Emergence  of  Big  Data  and  Multiple  Analytical  Workloads...................................... 6   Characteristics  of  Multi-­‐structured  and  Structured  Big  Data............................... 6   Big  Data  Analytical  Workloads ............................................................................. 7   Hadoop  –  A  Key  Platform  for  Big  Data  Analytics.................................................. 7     Building  An  Enterprise  Data  Hub  Using  MapR.................................................................. 8   What  Is  An  Enterprise  Data  Hub? ........................................................................ 8   The  MapR  Hadoop  Distribution  as  an  Enterprise  Data  Hub  Platform.................. 9   MapR  Disaster  Recovery  and  Data  Protection ......................................... 9   Hadoop  Workloads  and  MapR  Extensions ............................................. 10   The  Data  Refinery  -­‐  Accelerating  ETL  Processing  at  Low  Cost............................ 10   The  Data  Refinery  -­‐  Exploratory  Analysis........................................................... 12   Accelerating  Big  Data  Consumption  and  Filtering  Using  Automated     Analytics  During  in-­‐Hadoop  ELT  Processing ........................................... 12   Key  MapR  Features  That  Meet  Enterprise  Data  Hub  Requirements.................. 13   Data  Hub  ELT  Processing  With  MapR  Hadoop  Distributions.............................. 14   Hadoop  as  a  Data  Hub  for  All  Analytical  Platforms............................................ 15   Feeding  Data  Warehouses  from  a  Hadoop  Data  Hub  to  Produce  New  Insight     from  Enriched  Data............................................................................................ 16   Archiving  Data  Warehouse  Data  into  Hadoop................................................... 17     Conclusion...................................................................................................................... 18  
  • 3. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 3 MANAGEMENT SUMMARY Over recent years many companies have seen huge growth in data volumes. This has come from both existing structured and new semi-structured and unstructured data sources. The relentless rise in online shoppers, as well as the convenience of mobile devices, is contributing significantly to accelerating transaction volumes. Therefore transaction data is rapidly on the increase and click stream data from online browsing is reaching unprecedented volumes. It also harbours deep insight into online customer behaviour. For most companies, the way they analyse sales and other transaction activity is by extracting this data from e-commerce systems, cleaning, transforming and integrating it with customer, product and financial data from other core transaction processing systems, loading it into a data warehouse and then analysing subsets of it in data marts using business intelligence tools. Over the years as transaction data has grown and other data sources have become available for analysis, the challenge of data extraction, transformation and loading (ETL) has become increasingly difficult to scale. ETL tools have switched to ELT to extract and load data into data warehouse staging tables first before using the power of parallel SQL processing in a massively parallel database to deal with scalability. However today, the momentum behind the use of online channels as the preferred way of transacting business and interacting with companies has become so great that data volumes are increasing at rates we have not seen before. Click stream, inbound emails interactions, social media, interactions, and sensor data are taking data volumes to new heights. The result is that staging areas in data warehouses set aside for ETL processing are becoming so large that ETL processing on its own is driving expensive upgrades to data warehouses to handle the workload. In addition, analysis of complex data is also happening on Hadoop to analyse new data types such as text, JSON, clickstream, images and video. This paper looks at an alternative solution—the creation of an Enterprise Data Hub with data landing zone, data refinery on a much lower cost Hadoop platform that can scale to manage increasing data volumes as well as integrate structured master and transaction data with more complex high value data like clickstream, and multi-structured interaction data. In addition we look at how Hadoop can be used as an analytical platform to support exploratory analysis of raw data within a data refinery in the Enterprise Data Hub to produce new insights that can be published and offered up to business analysts for use in further analyses. From here, business analysts throughout the enterprise can subscribe to receive new insights into traditional data warehouses and data marts so as to enrich what companies already know with the intent to deliver competitive advantage in existing and new markets. We will also look at how Hadoop can act as a long-term data store for big data as well as an on-line archive for data warehouse data that is no longer analysed on a frequent basis. MapR is a Hadoop vendor that has enhanced its MapR M5 Edition and MapR M7 Edition to support high availability features such as JobTracker HA™ and No NameNode HA™, MapR Direct Access NFS™, snapshots for online point- in-time data recovery, automatic data compression, remote mirroring, disaster recovery, and data protection. Its disaster recovery and data protection features make M5 and M7 capable of becoming long-term low cost data store The switch to online channels is driving unprecedented volumes of transaction data and ciickstream data This is driving up the cost of data warehousing as staging areas holding data for ETL processing grow rapidly Companies are looking to lower the cost of data warehousing by archiving data and offloading processing Hadoop offers a complementary low cost alternative that supports big data analytics and the ability to offload ETL processing MapR has created an enterprise-grade Hadoop platform that supports long term data storage, data warehouse archive, offloading of ETL processing and big data analytics
  • 4. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 4 where new big data sources can be analysed and archived data from traditional data warehouses can be stored and selectively reprocessed. In addition, M5 and M7 offer workload management support allowing a Hadoop cluster to be logically divided to support different use cases, job types, user groups, and administrators. Also jobs can be isolated. All this helps support multiple workloads and allows usage to be managed and tracked. These capabilities make MapR an enterprise-grade Hadoop platform capable of supporting an enterprise data hub encompassing a data landing zone and data refinery where data can be cleaned, integrated and analysed by data scientists to produce new insights for competitive advantage. These new insights can then be supplied to data warehouses, data marts and other analytic platforms forming the data foundation of a multi-platform analytical ecosystem. It can also act as an enterprise data hub to supply data to a multi- platform analytical ecosystem
  • 5. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 5 INTRODUCTION - DATA WAREHOUSING AND THE ORIGINS OF ETL PROCESSING For many years, companies have been building data warehouses to analyse business activity and produce insights for decision makers to act on to improve business performance. These traditional analytical systems are often based on a classic pattern where data from multiple transaction processing systems is captured, cleaned, transformed and integrated before loading it into a data warehouse. Initially, the challenge of capturing, cleaning and integrating data was the role of IT programmers who wrote hand-crafted code to extract, transform and load (ETL) data from multiple sources into newly designed data warehouse databases for subsequent analysis and reporting. Soon however, new software ETL tools emerged to take on this task and improve productivity. Some of these tools generated 3GL and 4GL code to do the work while others interpreted graphically defined rules at run time. ETL execution involved extracting data from multiple operational systems, moving the data to the ETL server, and transforming and integrating it on the server before loading it into a target data warehouse. In the early years as customer demand grew, vendors added support for more and more structured data sources including popular packaged transaction processing applications, new file formats and popular external data providers. However, more data sources led to larger data volumes causing many customers to start hitting performance limitations especially when data was being totally refreshed. ETL tool vendors responded by adding support for change data capture, but even so, the problem of ETL performance emerged again as business demand for data increased. SCALING UP DATA INTEGRATION – THE SHIFT FROM ETL TO ELT To counter this problem, many ETL vendors began to look at new ways of achieving scalability. One of the most popular ways adopted to do this was to exploit parallel query processing in massively parallel (MPP) relational DBMSs. Rather than just loading transformed data from an ETL server into target MPP RDBMSs, several ETL vendors realised that they could boost performance by capturing data from multiple data sources, loading it into staging tables on a target MPP RDBMS and then generating SQL to transform the data using massively parallel query processing in the DBMS. The result was significant performance improvement that also made it possible for transformed, integrated data to then be moved from staging tables into production as a “within the box” process on the same RDBMS platform. This approach gave rise to the term Extract, Load, Transform (ELT) whereby MPP RDBMSs took data integration scalability to a new level. Extract, Transform and Load (ETL) tools emerged in the early years of data warehousing to extract, clean, transform and integrate data from multiple transaction processing systems into data warehouses ETL tools while successful, experienced performance problems as the demand for data grew ETL tools vendors switched to loading data into MPP RDBMS staging tables first and then used SQL to transform it in parallel. This became known as ELT processing
  • 6. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 6 THE EMERGENCE OF BIG DATA AND MULTIPLE ANALYTICAL WORKLOADS Although this traditional environment is now mature, many new more complex types of data have now emerged that businesses want to analyse to enrich what they already know. In addition, the rate at which much of this new data is being created and/or generated is far beyond what we have ever seen before. Customers and prospects are creating huge amounts of new data on social networks and review web sites. In addition, online news items, weather data, competitor web site content, and even data marketplaces are now available as candidate data sources for business consumption. Within the enterprise, web logs are growing at staggering rates as customers switch to online channels as their preferred way to transact business and interact with companies. Also, increasing amounts of sensor networks and machines are being deployed to instrument and optimise business operations. The result is an abundance of new “big data” sources, rapidly increasing data volumes and a flurry of new data streams that all need to be analysed. CHARACTERISTICS OF MULTI-STRUCTURED AND STRUCTURED BIG DATA The characteristics of these new data sources are different from the structured data that has been analysed in data warehouses for the last twenty years. For example, the variety of data types being captured now includes: • Structured data • Semi-structured data, e.g. XML, HTML • Unstructured data, e.g. text, audio, video • Machine-generated data, e.g. sensor data Semi-structured data such as XML allows navigation of XML paths to occur to go deeper into the content to derive business value. Unstructured text requires text mining to parse the data to derive structured data from unstructured while also building full text indexing. Deriving insight from unstructured sound and video data is more challenging but even here, demand is growing especially from government agencies and law enforcement. In addition to data variety, the volumes of data are also increasing Unstructured and machine-generated data in particular, can be very large in volume. However, volumes of structured transaction data are also increasing rapidly mainly because of the growth in the use of online channels from desktop computers and mobile devices. One side effect of much larger transaction volumes is that staging tables on data warehouse that hold data awaiting ELT processing are increasing rapidly, which in turn is forcing companies to upgrade data warehouse platforms, often at considerable cost, to hold more data. Finally, the rate (velocity) at which data is being generated is also increasing. Clickstream data, sensor data and financial markets data are good examples of this and are sometimes referred to as data streams. Businesses now want to analyse new more complex types of data to add new insights to what they already know Social network data, web logs, archived data warehouse data and sensor data are all new data sources of attracting analytical attention The variety of data is more complex than traditional data warehousing with multi-structured data now in demand Big data can be much larger in volume Machine-generated data is being created at very high rates
  • 7. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 7 BIG DATA ANALYTICAL WORKLOADS The arrival of big data and big data analytics has taken us beyond the traditional analytical workloads seen in data warehouses. Examples of new analytical workloads include: • Analysis of data in motion • Complex analysis of structured data • Exploratory analysis of un-modeled multi-structured data • Graph analysis e.g. social networks • Accelerating ETL processing of structured and multi-structured data to enrich data in a data warehouse or analytical appliance • The long term storage and reprocessing of archived data warehouse data for rapid selective retrieval These new analytical workloads are more likely to be processed outside of traditional data warehouses and data marts on platforms more suited to these kinds of workloads. HADOOP – A KEY PLATFORM FOR BIG DATA ANALYTICS One key platform that has emerged to support big data analytical workloads is Apache Hadoop. The Hadoop software “stack” has a number of components including: Component Description Hadoop HDFS A distributed file system that partitions large files across multiple machines for high-throughput access to data Hadoop YARN A framework for job scheduling and cluster resource management Hadoop MapReduce A programming framework for distributed batch processing of large data sets distributed across multiple servers Hive A data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into MapReduce programs HBase An open-source, distributed, versioned, column-oriented store modeled after Google's BigTable Pig A high-level data-flow language for expressing MapReduce programs for analyzing large HDFS distributed data sets Mahout A scalable machine learning and data mining library Oozie A workflow/coordination system to manage Hadoop jobs Spark A general purpose engine for large scale data processing in- memory. It supports analytical applications that wish to make use of stream processing, SQL access to columnar data and analytics on distributed in-memory data Zookeeper A coordination service for distributed applications Big data has created new analytical workloads beyond those typical of traditional data warehouses and data marts Hadoop has emerged as a platform very much at the centre of big data analytics Hive is a data warehouse system for Hadoop that provides a mechanism to project structure on Hadoop data Hive provides an interface whereby SQL can be converted into MapReduce programs Mahout offers a whole library of analytics that can exploit the full power of a Hadoop cluster
  • 8. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 8 BUILDING AN ENTERPRISE DATA HUB USING MAPR WHAT IS AN ENTERPRISE DATA HUB? Having discussed the characteristics of new sources of data, new analytical workloads that now need to be supported and Hadoop as a key platform for analytics, a key question is “How does Hadoop fit into an existing analytical environment?” A key emerging role for Hadoop is that of an Enterprise Data Hub as shown in Figure 1. Figure 1 An enterprise data hub is a managed and governed Hadoop environment in which to land raw data, refine it and publish new insights that can be delivered to authorised users throughout the enterprise, either on-demand or on a subscription basis. These users may want to add the new insights to existing data warehouses and data marts to enrich what they already know and/or conduct further analyses for competitive advantage. The Enterprise Data Hub consists of: • A managed data reservoir • A governed data refinery • Published, protected and secure high value insights • Long-term storage of archived data from data warehouses The$Managed$Hadoop$Enterprise$Data$Hub$Includes$A$Data$ Reservoir$and$A$Data$Refinery$And$A$Zone$For$New$Insights$ Parse$&$Prepare$Data$in$Hadoop$(MapReduce)$ Transform$&$Cleanse$Data$in$Hadoop$(MapReduce)$ $Discover$data$in$Hadoop$ ELT$ work$ Jflow$ sandbox$ other$data$ sandbox$ sandbox$ Data$ Reservoir$ Load$data$into$Hadoop$ Data$ Refinery$ New$high$value$ Insights$ (pub/sub)$ EDW$ Graph$ DBMS$$ DW$ appliance$
  • 9. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 9 All of this is made available in a secure, well-governed environment. Within the enterprise data hub, a data reservoir is where raw data is landed, collected and organised before it enters into the data refinery where data and data relationships are discovered, data is parsed, profiled, cleansed, transformation and integrated. It is then made available to data scientists who may combine it with other trusted data such as master data or historical data from a data warehouse before conducting exploratory analyses in a sandbox environment to identify and produce new business insights. These insights are the output of the data refining process. They are made available to other authorised users in the enterprise by first describing them using common vocabulary data definitions and then publishing them into a new insights zone where they become available for distribution to other platforms and analytical projects. In addition, cold data not being used THE MAPR DISTRIBUTION FOR HADOOP AS AN ENTERPRISE DATA HUB PLATFORM MapR is a vendor that provides a Hadoop platform upon which to build a managed Enterprise Data Hub. MapR was founded in 2009, is based in San Jose, California, and offers three editions of its Hadoop Distribution. These are: MapR Distribution for Apache Hadoop Description MapR M3 Standard Edition A free community edition that includes HBase™, Pig, Hive, Mahout, Cascading, Sqoop, Flume etc. It includes POSIX- compliant NFS file system access. MapR M5 Enterprise Edition This enterprise edition includes HBase™, Pig, Hive, Mahout, Cascading, Sqoop, Flume, Impala, Spark etc. M5 edition is a no single point of failure edition with high availability and data protection features such as JobTracker HA, no NameNode HA, Snapshots and Mirroring to synchronise data across clusters. MapR M7 Enteprise Database Edition The MapR M7 Database Edition includes all the capabilities of M5 plus enterprise-grade modifications to HBase to make it more dependable and faster. MapR Disaster Recovery and Data Protection MapR has also strengthened Hadoop by adding support for disaster recovery and data protection to their M5 and M7 Hadoop distribution offerings. In the area of disaster recovery, MapR provides remote mirroring to keep a synchronized copy of data at a remote site, so that processing can continue uninterrupted in the case of a disaster. Management of multiple on-site or geographically dispersed clusters is available with the MapR Control System. With respect to data protection, MapR has no single points of failure, with no NameNode HA and distributed cluster metadata. MapR Snapshot provides point-in-time recovery while MapR Mirroring offers business continuity. Everything in the MapR distribution is logged,  and able to restart with the intent MapR has three editions of its distribution for Hadoop MapR provides a distribution for Hadoop that includes over 20 Apache Hadoop projects
  • 10. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 10 that the entire cluster is self-healing and self-tuning. The JobTracker and NameNode have been re-engineered to be distributed and replicated. Direct Access NFS HA means that clients do not idle waiting for unavailable servers and rolling upgrades make sure that the cluster is always available. In addition, workload management is also supported including job isolation, job placement control, logical volumes, SLA enforcement and enterprise access control to isolate and secure data access. Hadoop Workloads and MapR Extensions Specific examples of where Hadoop is particularly well suited include: • Offloading and accelerating data warehouse ELT processing at low cost • Exploratory analysis of un-modeled multi-structured data • Extreme analytics – for example having to run millions of scoring models concurrently on millions of accounts to detect “cramming” fraud on credit cards. This is an activity whereby fraudsters attempt to steal small amounts of money from large numbers of credit card accounts by associating false charges with vague financial services and hoping consumers just don’t notice. Running millions of analytical models concurrently on data is typically not a workload you would see running in a data warehouse. • The long term storage of data and reprocessing of archived data warehouse data for rapid selective retrieval These are all workloads that you would expect to find in an Enterprise Data Hub. The MapR enhancements to the underlying data platform which powers their Hadoop distribution, provide capabilities needed to support these including continuous data capture, offloading of ELT processing, exploratory analytics, long term storage of archived warehouse data, and selective retrieval of it for analytical processing. Let’s look at these in more detail with particular focus on the data refining process and offloading ELT processing and some analytical workloads from data warehouses. THE DATA REFINERY - ACCELERATING ETL PROCESSING AT LOW COST The evolution of ETL on big data platforms like Hadoop has mirrored that on traditional data warehouses. First of all, hand-crafted ETL programs were written to provision data into Hadoop, transform and integrate it for exploratory analysis. The problem with this approach is that even if these programs exploit the multi-processor, multi-server Hadoop platform, development is slow and expensive requiring scarce MapReduce programming skills. ETL tool vendors responded by announcing support for Hadoop as both a target to provision data for exploratory analysis and a source to move derived insights from Hadoop into data warehouses. However, while this approach works, ETL processing occurs outside the Hadoop environment and so is unable to exploit the scalability of the Hadoop platform to deal with the characteristics of big data. In order to get scalability, ETL vendors have evolved their products to exploit Hadoop by implementing ELT processing much like it did on data warehouse systems. The difference now, however, is that all the data is being loaded into a Hadoop cluster for ELT processing via generated 3GL, Hive or Pig ELT jobs Hand-crafted ETL programs were initially created in a Hadoop environment ETL servers then emerged to handle big data integration to load into Hadoop
  • 11. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 11 running natively on a low cost Hadoop cluster. This is shown in Figures 2 and 3. It is this capability that is so attractive to many companies who are looking for a way to offload ELT processing from data warehouses and create an enterprise data hub. Offloading ELT processing to Hadoop frees up considerable capacity on data warehouse platforms thereby potentially avoiding expensive data warehouse upgrades. This is especially significant as transaction data volumes continue to grow and new big data sources become available for analysis. Figure 2 Several ETL tool vendors have now re-written transforms to run on Hadoop. Also, several have added new tools and transformations to handle large volumes of multi-structured data. Therefore what we are seeing in Hadoop environments is a full repetition of what happened with ETL tools on MPP RDBMSs. This time however, the attraction is that ELT processing can potentially be done at a much lower cost given that a Hadoop cluster is a much cheaper platform to store any kind of data. It also opens up the way for ELT processing to be offloaded from data warehouses and to exploit the full power of a Hadoop cluster to get the scalability needed to improve performance in a big data environment. Figure 3 Data$Cleansing$and$Integra/on$Tool$$ Extract Parse Clean Transform AnalyseLoad Insights$ Op/on$1$ ETL$tool$generates$HQL$or$ convert$generated$SQL$to$ HQL$ Op/on$2$ ETL$tool$generates$ Pig$La/n$ (compiler$converts$ every$transform$to$a$ map$reduce$job)$ Op/on$3$ ETL$tool$generates$ 3GL$MapReduce$code$ Scaling$ETL$Transforma/ons$By$Genera/ng$Pig,$Hive$or$ 3GL$MapReduce$Code$for$InMHadoop$ELT$Processing$ Provisioning Data Into Hadoop for Exploratory Analysis of Multi-Structured Data Using In-Hadoop ELT processing Web logs Generated MapReduce ELT jobs business insight sandbox sandboxUn-modelled multi- structured data Structured data Filtered sensor data ELTProcessing ETL servers that handle big data cleansing and integration outside of Hadoop are unlikely to scale well – they need to exploit Hadoop ETL processing in a big data environment needs to exploit Hadoop to get scalability at low cost Several ETL vendors have ported their software to Hadoop to run ELT map- reduce processing on multi-structured big data
  • 12. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 12 The aforementioned attractions of the Figure 3 pattern are leading many companies to consider placing a significant slice of their structured and multi- structured data into Hadoop Enterprise Data Hub for ELT processing before making subsets of it available exploratory analysis on Hadoop itself or to other data warehouse and NoSQL platforms. The only challenge with this is the migration of existing ELT jobs running on existing data warehouses which is helped by Hive being able to convert ELT-generated SQL (used to transform data) into map reduce or potentially even in-memory Spark programs. THE DATA REFINERY - EXPLORATORY ANALYSIS In addition to ETL processing, another analytical workload very much part of the data refining process in a Hadoop Enterprise Data Hub is the exploratory analysis of complex data types. This is where data scientists in the enterprise data hub often use freeform exploratory tools like search and/or develop and run batch MapReduce or Spark analytic applications (written in languages like Java, Python, Scala and R) to conduct exploratory analyses on un-modelled data stored in the Hadoop system. The purpose of this analysis in the data refining process, is to derive structured insight from unstructured data that may then be stored in HBase, Hive or moved into a data warehouse for further analysis. With Hadoop MapReduce, these analytical programs are copied to thousands of compute nodes in a Hadoop cluster where the data is located in order to run the batch analysis in parallel. In addition, in-Hadoop analytics in the Mahout library can run in parallel close to the data to exploit the full power of a Hadoop cluster. The addition of Spark means that MapR can improve performance of exploratory analytical applications by exploiting in-memory processing. Data access can also be simplified by using SQL via Shark on Spark instead of lower level HDFS APIs. Insight derived from this exploratory analysis can then be published and moved into data warehouses to enrich value or to other analytical data stores for further analysis. Accelerating Big Data Consumption and Filtering Using Automated Analytics During in-Hadoop ELT Processing One of the challenges with big data is dealing with the data deluge. Companies have to step up to the challenge that data is arriving faster than they can consume it. They therefore have to find a way to automate the consumption and refining of popular data to deal with the data deluge and bring data into the enterprise in a timely way for the business to analyse and act on. That means not only doing ETL processing on Hadoop, but also being able to analyse data during ELT processing as part of the data refining process. An example might be to score Twitter sentiment during ELT processing of Twitter data on Hadoop so that negative sentiment can be identified quickly and attributed to customers, products, brand or business functions like customer service. Figure 4 takes in-Hadoop ELT processing further than in Figure 1 by doing in- line automated analytics on Hadoop during ELT processing. In this way, popular structured and multi-structured data sources may be consumed and refined in a more automated way, thereby expediting time to value. In addition, if data scientists have built custom map reduce or Spark based analytics on this kind of data, then it is potentially possible to exploit these analytics during ELT processing. This means that once data scientists have built analytics that analyse data, they can be used and re-used in analyses to produce insight Embedding in- Hadoop analytics in Hadoop-based ELT processing allows data to be consumedmore rapidly and in a more automated way
  • 13. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 13 from new big data sources. Note how in Figure 4 automated analysis allows the ELT workflow to span entire data refining process. Figure 4 In this way it becomes possible to build an “Enterprise Data Filter” that can speed consumption of new and existing data sources to expedite the production of new high value business insights. KEY MAPR FEATURES THAT MEET ENTERPRISE DATA HUB REQUIREMENTS Given what is potentially possible, the next question is ‘How has MapR enhanced its Hadoop distributions to support the operation of an Enterprise Data Hub?” Since its inception, MapR has sought to enhance a critical part of the open source Apache Hadoop stack to improve availability, open up access and improve overall performance and usability. MapR has strengthened Apache Hadoop considerably to improve its resilience, improve performance and make it easier to manage. For example, they have removed multiple single points of failure in Apache Hadoop and introduced data mirroring across clusters, using asynchronous replication, to support failover and disaster recovery. In addition, they have added data snapshots, a heat map management console and have improved performance through data compression and by rewriting the intermediate shuffle phase that occurs after Map and before Reduce. HBase has also been strengthened (MapR M7 Edition) to remove compactions and Spark has been added to facilitate high performance in-memory analytics. All of this makes the MapR distributions much more enterprise-grade. Key features of MapR M5 and M7 that benefit ETL processing include: The$Managed$Hadoop$Enterprise$Data$Hub$Includes$Automated$ Analyses$to$Refine$Data$Much$More$Rapidly$ Parse$&$Prepare$Data$in$Hadoop$(MapReduce)$ Transform$&$Cleanse$Data$in$Hadoop$(MapReduce)$ $Discover$data$in$Hadoop$ ELT$ work$ Gflow$ other$data$ Data$ Reservoir$ Load$data$into$Hadoop$ Data$ Refinery$ New$high$value$ Insights$ (pub/sub)$ EDW$ Graph$ DBMS$$ Automated$InvocaOon$of$Custom$Built$&$PreGbuilt$ AnalyOcs$on$Hadoop$ DW$ appliance$ MapR has strengthened Apache Hadoop to improve resilience and performance MapR features that benefit ETL processing include high availability and Direct Access NFS
  • 14. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 14 • No single point of failure edition with high availability features such as JobTracker High Availability and no-NameNode High Availability. In a global business where ETL processing may need to happen several times within the day, high availability is very important • MapR Direct Access™ NFS which enables real-time read/write data flows via the industry-standard Network File System (NFS) protocol. With MapR Direct Access NFS, any remote client can simply mount the cluster. This means that application servers can write their log files and other data directly into the cluster, rather than writing it first to direct- or network-attached storage. This reduces the need for log collection tools that may require agents on every application server. Application servers can either write data directly into the cluster or use standard tools like Rsync to synchronize data between local disks and the cluster. Either way, it means that ELT processing on log data needed for clickstream analytics could potentially avoid the extract and load into MapR M5/M7, thus speeding up the process of making this data available for analysis. It also reduces data latency which can be important in many applications. • MapR Snapshots allow for online point-in-time data recovery without replication of data. A volume snapshot is consistent (atomic), does not copy data, and does not impact performance. Snapshots can potentially help ETL processing in the event of a failure where an ETL job may need to be restarted from the point of failure or at least from an intermediate snapshot taken at specific points in the ETL processing. • The re-writing of the intermediate shuffle phase that occurs after Map and before Reduce can really help improve ETL performance for ETL tools generating map-reduce ETL jobs via Hive, Pig or natively with a 3GL language such as Java. • Automatic data compression can also help improve performance and so speed up data refinery processes. • The MapR data protection and disaster recovery capabilities make the MapR Distributions for Hadoop suitable for long-term storage of big data and data warehouse archived data which can then be selectively reprocessed in specific analyses even though it is offloaded from the data warehouse. • The MapR remote mirroring capability also allows ELT and analytical workloads in a data refinery to be spread across clusters in order to get more work done. DATA HUB ELT PROCESSING WITH MAPR HADOOP DISTRIBUTIONS With respect to ELT processing on Hadoop, MapR partners with a number of ETL tool vendors and ETL accelerator vendors that can run ELT jobs on MapR M5 and M7 Hadoop clusters. These include: • Informatica • Pentaho • Talend Direct Access NFS speeds up the ability to capture data and support change data capture MapR snapshots help to support ‘point-in-time’ restart of big data ETL processing in the event of a failure without going back to the beginning Rewrite of shuffle and data compression helps improve ETL processing performance MapR has several ETL partners that run on Hadoop to accelerate ETL processing MapR together with its ETL partners can support most ETL patterns
  • 15. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 15 • Syncsort Using these partner technologies in combination with the MapR M5 and M7 editions, the following patterns are supported: • Accelerating big data consumption and filtering by using in-Hadoop analytics during ELT processing • In-Hadoop ELT processing via MapReduce-based transformations • Provisioning data into Hadoop sandboxes for exploratory analysis as part of a data refining process • Feeding data warehouses from Hadoop to accelerate multi-platform analytics In terms of scalability, adding more Hadoop nodes to the cluster allows you to process this data at speed due to greater I/O parallelism and compute power. For very large amounts of data, a MapR Hadoop cluster can be spun up in a cloud environment like the Google Compute Engine to undertake this work at a much lower cost than trying to configure a cluster of similar size in-house. With respect to restart of ELT processing, MapR snapshots taken at specific points in ELT processing makes it possible to restart any data refinery ELT process quickly. MapR also provides random read/write access in its Hadoop distribution. In Apache Hadoop, HDFS is normally append-only, but one of the key features of the MapR Distribution is Direct Access NFS. In the context of ELT, Direct Access NFS allows faster and more convenient loading of data into the Hadoop cluster, thereby reducing data latency. ETL tools that support Change Data Capture can write changes straight into the MapR Hadoop cluster. An example of a MapR partner ETL tool vendor that can do this is Talend. Change data capture is very important to ETL performance, especially on large volumes of data. Finally in terms of ELT performance, the re-writing of the intermediate shuffle phase that occurs after Map and before Reduce will benefit sorting, aggregation, hashing and pattern matching transformations all of which are mainstream transformation functionality needed in most ELT jobs. This functionality, together with data compression, will boost performance. These features allow MapR to support the following key ETL patterns. HADOOP AS A DATA HUB FOR ALL ANALYTICAL PLATFORMS Given these enhancements, the MapR Distribution could potentially be used not only to offload processing from data warehouses but also to create a low cost data hub (see Figure 5). An Enterprise Data Hub is the foundation pattern in data and new insight provisioning in a multi-platform analytical environment. It would be possible to use MapR M5 or M7 as an Enterprise Data Hub that cleans, transforms and integrates data from multiple structured and multi- structured sources and provisions trusted data into any analytical platform in a big data analytical ccosystem for subsequent analysis. This includes: • MapR M5/M7 Hadoop distribution itself, where sandboxes are created for data scientists to conduct exploratory analysis as part of a data refining process ETL processing can take place on premises or in the cloud ETL jobs can be designed with restart in mind using MapR snapshots ETL jobs can handle change data capture using MapR Direct Access NFS ETL performance is accelerated using MapR shuffle processing and data compression Archiving data from data warehouses into Hadoop is also needed
  • 16. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 16 • Enterprise data warehouses • Data marts • Analytical appliances • Other NoSQL databases e.g. graph databases Figure 5 FEEDING DATA WAREHOUSES FROM A HADOOP DATA HUB TO PRODUCE NEW INSIGHT FROM ENRICHED DATA Having transformed data on Hadoop and produced insights from it, there is a need to add any new insights produced to existing environments to add to what is already known. This means being able to also have ETL tools extract derived insights produced on Hadoop from that platform and integrate them with other structured data going into a data warehouse (see Figure 6). This may happen on Hadoop itself (i.e. push data into a data warehouse) or outside of Hadoop to pull the data into a data warehouse. In this way we can facilitate multi-platform analytics that may start by analysing data on Hadoop and end up offering new insights to self-service BI users accessing a data warehouse. By embedding analytics in Hadoop ELT processing, it is also potentially possible to turn ELT workflows into multi-platform analytical workflows. Data Hub - Consume, Clean, Integrate, Analyse And Provision Data From Hadoop To Any Analytical Platform Generated MapReduce ELT jobs business insight sandbox ELT Processing feedssensors !"#$% &'()% RDBMS Files office docssocial Cloud *+,*-./0123% Web logs web services NoSQL DB e.g. graph DB EDW DW & marts mar t DW Appliance Advanced Analytics (structured data) Exploratory analysis ETL tools also need to extract data from Hadoop and provide it to data warehouses and other NoSQL data stores Providing new insights from Hadoop into data warehouses is a very common requirement
  • 17. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 17 Figure 6 ARCHIVING DATA WAREHOUSE DATA INTO HADOOP In order to maximise the value from ETL processing in a big data environment, it must be possible to move data from Hadoop into other NoSQL and relational analytical platforms and vice-versa. This includes orchestrating multi-platform analytical ETL workflows to solve complex analytical problems. Figure 7 Figure 7 shows this capability. With two-way data movement it becomes possible to take dimension data into Hadoop and archive data from data warehouses into Hadoop. It also becomes possible to manage data across all data stores and analytical platforms in Hadoop. What this also shows is that data management software has to scale much more than before, not just to handle big data volumes, but also to handle data movement across platforms during analytical processing. It will also be used for data archiving across platforms. Therefore ETL scalability on a robust, highly available self-healing Hadoop platform like MapR is even more important going forward. Leveraging Hadoop for Data Integration on Massive Volumes of Data to Bring Additional Insights Into a DW Cloud Data HDFS Extract DW E T LMap/ Reduce data transformation and analytics applications Transform e.g. PIG, JAQL Cloud Data e.g. Deriving insight from huge volumes of social web content on sites like Twitter, Facebook. Digg, Myspace, TripAdvisor, LinkedIn….for sentiment analytics Hundreds of terabytes up to petabytes relevant insight Operational systems EDW MDM SystemDW & marts NoSQL DB e.g. graph DB Advanced Analytic (multi-structured data) mart DW Appliance Advanced Analytics (structured data) Need to Manage the Supply of Consistent Data And Archive Data Across The Entire Analytical Ecosystem Enterprise Information Management Tool Suite Stream processing C R U D Prod Asset Cust actions feedssensors !"#$% &'()% RDBMS Files office docssocial Cloud *+,*-./0123% Web logs web services New New New New New New New New NewNew New New C R U D Prod Asset Cust Archiving data from data warehouses into Hadoop is also needed There is a need for two-way movement of data between Hadoop and other data stores and to manage data across all analytical platforms in a big data ecosystem
  • 18. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 18 CONCLUSION The emergence of new data sources and the need to analyse everything from unstructured data to live event streams has led many organisations to realise that the spectrum of analytical workloads is now so broad that they cannot all be dealt with in a single enterprise data warehouse. Companies now need multiple analytical platforms in addition to traditional data warehouses and data marts to manage big data workloads. New big data platforms like Hadoop, stream processing engines, and NoSQL graph DBMSs are all emerging as platforms optimised for specific analytical workloads that need to be added to the enterprise analytical setup. This has resulted in a more complex analytical environment that has put much more emphasis on data management to keep data consistent across big data workload-optimised analytical platforms and traditional data warehouses. ETL software now has to deal with multiple data types, very large data volumes and high velocity event streams as well as handling traditional ETL processing into data warehouses. In addition, this software must now deal with the need to rapidly move data between big data and traditional data warehousing platforms during the execution of analytical workloads. All this is needed while continuing to deliver value for money and without causing dramatic increases in cost. Data refinery processes have to be fast, efficient, simple to use and cost-effective. Several data management vendors now support Hadoop as both a source and a target in their ETL tools. They are also generate HiveQL, Pig or Java to create map reduce ELT processing jobs that fully exploit massively parallel Hadoop clusters. To support faster filtering and consumption of data, we are also seeing ETL tools starting to support the embedding of analytics into ETL workflows so that fully automated analytical workflows can be built to speed up the rate at which organisations can consume, analyse and act on data. It is this combination of Hadoop with data management software and in- Hadoop analytics that opens up the attractive proposition of creating a low cost Enterprise Data Hub (as shown in Figure 4) that manages and accelerates the data refinery process in an end-to-end big data analytical ecosystem. The MapR Distribution for Hadoop is well suited to this role and can also support the offloading of a subset of analytical processing from data warehouses. The Enterprise Data Hub is not just for data warehousing however. Its job is to become the foundation for cleansing, transforming and integrating structured and multi-structured data from multiple sources before provisioning filtered data and new insights to any platform in the entire big data analytical ecosystem for subsequent analysis. MapR, with its enterprise-grade Hadoop distribution together with its partners, look ready for the challenge. Business is now demanding more analytical power to analyse new sources of structured and multi- structured data ETL tools have to scale to support more data, more complex transformations and faster data loading Support for Hadoop and rapid data movement between Hadoop and data warehouses is needed Hadoop in combination with ETL processing offers an attractive low cost way to implement a data management hub for the entire big data analytical ecosystem MapR can help customers take ETL processing to the next level
  • 19. The Hadoop Data Refinery and Enterprise Data Hub Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 19 About Intelligent Business Strategies Intelligent Business Strategies is a research and consulting company whose goal is to help companies understand and exploit new developments in business intelligence, analytical processing, data management and enterprise business integration. Together, these technologies help an organisation become an intelligent business. Author Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specialises in business intelligence and enterprise business integration. With over 31 years of IT experience, Mike has consulted for dozens of companies on business intelligence strategy, big data, data governance, master data management, enterprise architecture, and SOA. He has spoken at events all over the world and written numerous articles. He has written many articles, and blogs providing insights on the industry. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of Database Associates, an independent analyst organisation. He teaches popular master classes in Big Data Analytics, New Technologies for Business Intelligence and Data Warehousing, Enterprise Data Governance, Master Data Management, and Enterprise Business Integration. INTELLIGENT   BUSINESS   STRATEGIES   Water Lane, Wilmslow Cheshire, SK9 5BG England Telephone: (+44)1625 520700 Internet URL: www.intelligentbusiness.biz E-mail: info@intelligentbusiness.biz The Hadoop Data Refinery and Enterprise Data Hub Copyright © 2014 by Intelligent Business Strategies All rights reserved