MapR Data Hub White Paper V2 2014

The Hadoop Data Refinery and
Enterprise Data Hub
Prepared for:
By Mike Ferguson
Intelligent Business Strategies
May 2014
WHITEPAPER INTELLIGENT

BUSINESS

STRATEGIES

The Hadoop Data Refinery and Enterprise Data Hub
Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 2
Table of Contents
Management
Summary.................................................................................................... 3

Introduction
-‐
Data
Warehousing
and
the
Origins
of
ETL
Processing............................... 5

Scaling
Up
Data
Integration
–
The
Shift
from
ETL
to
ELT ..................................... 5

The
Emergence
of
Big
Data
and
Multiple
Analytical
Workloads...................................... 6

Characteristics
of
Multi-‐structured
and
Structured
Big
Data............................... 6

Big
Data
Analytical
Workloads ............................................................................. 7

Hadoop
–
A
Key
Platform
for
Big
Data
Analytics.................................................. 7

Building
An
Enterprise
Data
Hub
Using
MapR.................................................................. 8

What
Is
An
Enterprise
Data
Hub? ........................................................................ 8

The
MapR
Hadoop
Distribution
as
an
Enterprise
Data
Hub
Platform.................. 9

MapR
Disaster
Recovery
and
Data
Protection ......................................... 9

Hadoop
Workloads
and
MapR
Extensions ............................................. 10

The
Data
Refinery
-‐
Accelerating
ETL
Processing
at
Low
Cost............................ 10

The
Data
Refinery
-‐
Exploratory
Analysis........................................................... 12

Accelerating
Big
Data
Consumption
and
Filtering
Using
Automated

Analytics
During
in-‐Hadoop
ELT
Processing ........................................... 12

Key
MapR
Features
That
Meet
Enterprise
Data
Hub
Requirements.................. 13

Data
Hub
ELT
Processing
With
MapR
Hadoop
Distributions.............................. 14

Hadoop
as
a
Data
Hub
for
All
Analytical
Platforms............................................ 15

Feeding
Data
Warehouses
from
a
Hadoop
Data
Hub
to
Produce
New
Insight

from
Enriched
Data............................................................................................ 16

Archiving
Data
Warehouse
Data
into
Hadoop................................................... 17

Conclusion...................................................................................................................... 18

MANAGEMENT SUMMARY
Over recent years many companies have seen huge growth in data volumes.
This has come from both existing structured and new semi-structured and
unstructured data sources. The relentless rise in online shoppers, as well as
the convenience of mobile devices, is contributing significantly to accelerating
transaction volumes. Therefore transaction data is rapidly on the increase and
click stream data from online browsing is reaching unprecedented volumes. It
also harbours deep insight into online customer behaviour.
For most companies, the way they analyse sales and other transaction activity
is by extracting this data from e-commerce systems, cleaning, transforming
and integrating it with customer, product and financial data from other core
transaction processing systems, loading it into a data warehouse and then
analysing subsets of it in data marts using business intelligence tools. Over the
years as transaction data has grown and other data sources have become
available for analysis, the challenge of data extraction, transformation and
loading (ETL) has become increasingly difficult to scale. ETL tools have
switched to ELT to extract and load data into data warehouse staging tables
first before using the power of parallel SQL processing in a massively parallel
database to deal with scalability. However today, the momentum behind the
use of online channels as the preferred way of transacting business and
interacting with companies has become so great that data volumes are
increasing at rates we have not seen before. Click stream, inbound emails
interactions, social media, interactions, and sensor data are taking data
volumes to new heights. The result is that staging areas in data warehouses
set aside for ETL processing are becoming so large that ETL processing on its
own is driving expensive upgrades to data warehouses to handle the workload.
In addition, analysis of complex data is also happening on Hadoop to analyse
new data types such as text, JSON, clickstream, images and video.
This paper looks at an alternative solution—the creation of an Enterprise Data
Hub with data landing zone, data refinery on a much lower cost Hadoop
platform that can scale to manage increasing data volumes as well as integrate
structured master and transaction data with more complex high value data like
clickstream, and multi-structured interaction data. In addition we look at how
Hadoop can be used as an analytical platform to support exploratory analysis
of raw data within a data refinery in the Enterprise Data Hub to produce new
insights that can be published and offered up to business analysts for use in
further analyses. From here, business analysts throughout the enterprise can
subscribe to receive new insights into traditional data warehouses and data
marts so as to enrich what companies already know with the intent to deliver
competitive advantage in existing and new markets. We will also look at how
Hadoop can act as a long-term data store for big data as well as an on-line
archive for data warehouse data that is no longer analysed on a frequent basis.
MapR is a Hadoop vendor that has enhanced its MapR M5 Edition and MapR
M7 Edition to support high availability features such as JobTracker HA™ and
No NameNode HA™, MapR Direct Access NFS™, snapshots for online point-
in-time data recovery, automatic data compression, remote mirroring, disaster
recovery, and data protection. Its disaster recovery and data protection
features make M5 and M7 capable of becoming long-term low cost data store
The switch to online
channels is driving
unprecedented volumes
of transaction data and
ciickstream data
This is driving up the
cost of data
warehousing as staging
areas holding data for
ETL processing grow
rapidly
Companies are looking
to lower the cost of data
warehousing by
archiving data and
offloading processing
Hadoop offers a
complementary low cost
alternative that supports
big data analytics and
the ability to offload ETL
processing
MapR has created an
enterprise-grade
Hadoop platform that
supports long term data
storage, data
warehouse archive,
offloading of ETL
processing and big data
analytics

where new big data sources can be analysed and archived data from
traditional data warehouses can be stored and selectively reprocessed. In
addition, M5 and M7 offer workload management support allowing a Hadoop
cluster to be logically divided to support different use cases, job types, user
groups, and administrators. Also jobs can be isolated. All this helps support
multiple workloads and allows usage to be managed and tracked.
These capabilities make MapR an enterprise-grade Hadoop platform capable
of supporting an enterprise data hub encompassing a data landing zone and
data refinery where data can be cleaned, integrated and analysed by data
scientists to produce new insights for competitive advantage. These new
insights can then be supplied to data warehouses, data marts and other
analytic platforms forming the data foundation of a multi-platform analytical
ecosystem.
It can also act as an
enterprise data hub to
supply data to a multi-
platform analytical
ecosystem

INTRODUCTION - DATA WAREHOUSING AND THE
ORIGINS OF ETL PROCESSING
For many years, companies have been building data warehouses to analyse
business activity and produce insights for decision makers to act on to improve
business performance. These traditional analytical systems are often based on
a classic pattern where data from multiple transaction processing systems is
captured, cleaned, transformed and integrated before loading it into a data
warehouse.
Initially, the challenge of capturing, cleaning and integrating data was the role
of IT programmers who wrote hand-crafted code to extract, transform and load
(ETL) data from multiple sources into newly designed data warehouse
databases for subsequent analysis and reporting. Soon however, new software
ETL tools emerged to take on this task and improve productivity. Some of
these tools generated 3GL and 4GL code to do the work while others
interpreted graphically defined rules at run time. ETL execution involved
extracting data from multiple operational systems, moving the data to the ETL
server, and transforming and integrating it on the server before loading it into a
target data warehouse. In the early years as customer demand grew, vendors
added support for more and more structured data sources including popular
packaged transaction processing applications, new file formats and popular
external data providers.
However, more data sources led to larger data volumes causing many
customers to start hitting performance limitations especially when data was
being totally refreshed. ETL tool vendors responded by adding support for
change data capture, but even so, the problem of ETL performance emerged
again as business demand for data increased.
SCALING UP DATA INTEGRATION – THE SHIFT FROM ETL TO ELT
To counter this problem, many ETL vendors began to look at new ways of
achieving scalability. One of the most popular ways adopted to do this was to
exploit parallel query processing in massively parallel (MPP) relational DBMSs.
Rather than just loading transformed data from an ETL server into target MPP
RDBMSs, several ETL vendors realised that they could boost performance by
capturing data from multiple data sources, loading it into staging tables on a
target MPP RDBMS and then generating SQL to transform the data using
massively parallel query processing in the DBMS. The result was significant
performance improvement that also made it possible for transformed,
integrated data to then be moved from staging tables into production as a
“within the box” process on the same RDBMS platform. This approach gave
rise to the term Extract, Load, Transform (ELT) whereby MPP RDBMSs took
data integration scalability to a new level.
Extract, Transform
and Load (ETL) tools
emerged in the early
years of data
warehousing to
extract, clean,
transform and
integrate data from
multiple transaction
processing systems
into data
warehouses
ETL tools while
successful,
experienced
performance
problems as the
demand for data grew
ETL tools vendors
switched to loading
data into MPP
RDBMS staging
tables first and then
used SQL to
transform it in
parallel. This
became known as
ELT processing

THE EMERGENCE OF BIG DATA AND MULTIPLE
ANALYTICAL WORKLOADS
Although this traditional environment is now mature, many new more complex
types of data have now emerged that businesses want to analyse to enrich
what they already know. In addition, the rate at which much of this new data is
being created and/or generated is far beyond what we have ever seen before.
Customers and prospects are creating huge amounts of new data on social
networks and review web sites. In addition, online news items, weather data,
competitor web site content, and even data marketplaces are now available as
candidate data sources for business consumption.
Within the enterprise, web logs are growing at staggering rates as customers
switch to online channels as their preferred way to transact business and
interact with companies. Also, increasing amounts of sensor networks and
machines are being deployed to instrument and optimise business operations.
The result is an abundance of new “big data” sources, rapidly increasing data
volumes and a flurry of new data streams that all need to be analysed.
CHARACTERISTICS OF MULTI-STRUCTURED AND STRUCTURED BIG DATA
The characteristics of these new data sources are different from the structured
data that has been analysed in data warehouses for the last twenty years. For
example, the variety of data types being captured now includes:
• Structured data
• Semi-structured data, e.g. XML, HTML
• Unstructured data, e.g. text, audio, video
• Machine-generated data, e.g. sensor data
Semi-structured data such as XML allows navigation of XML paths to occur to
go deeper into the content to derive business value. Unstructured text requires
text mining to parse the data to derive structured data from unstructured while
also building full text indexing. Deriving insight from unstructured sound and
video data is more challenging but even here, demand is growing especially
from government agencies and law enforcement.
In addition to data variety, the volumes of data are also increasing
Unstructured and machine-generated data in particular, can be very large in
volume. However, volumes of structured transaction data are also increasing
rapidly mainly because of the growth in the use of online channels from
desktop computers and mobile devices. One side effect of much larger
transaction volumes is that staging tables on data warehouse that hold data
awaiting ELT processing are increasing rapidly, which in turn is forcing
companies to upgrade data warehouse platforms, often at considerable cost, to
hold more data.
Finally, the rate (velocity) at which data is being generated is also increasing.
Clickstream data, sensor data and financial markets data are good examples of
this and are sometimes referred to as data streams.
Businesses now want
to analyse new more
complex types of data
to add new insights to
what they already
know
Social network data,
web logs, archived
data warehouse data
and sensor data are
all new data sources
of attracting
analytical attention
The variety of data is
more complex than
traditional data
warehousing with
multi-structured data
now in demand
Big data can be
much larger in
volume
Machine-generated
data is being
created at very
high rates

BIG DATA ANALYTICAL WORKLOADS
The arrival of big data and big data analytics has taken us beyond the
traditional analytical workloads seen in data warehouses. Examples of new
analytical workloads include:
• Analysis of data in motion
• Complex analysis of structured data
• Exploratory analysis of un-modeled multi-structured data
• Graph analysis e.g. social networks
• Accelerating ETL processing of structured and multi-structured data to
enrich data in a data warehouse or analytical appliance
• The long term storage and reprocessing of archived data warehouse
data for rapid selective retrieval
These new analytical workloads are more likely to be processed outside of
traditional data warehouses and data marts on platforms more suited to these
kinds of workloads.
HADOOP – A KEY PLATFORM FOR BIG DATA ANALYTICS
One key platform that has emerged to support big data analytical workloads is
Apache Hadoop. The Hadoop software “stack” has a number of components
including:
Component Description
Hadoop
HDFS
A distributed file system that partitions large files across
multiple machines for high-throughput access to data
Hadoop
YARN
A framework for job scheduling and cluster resource
management
Hadoop
MapReduce
A programming framework for distributed batch processing of
large data sets distributed across multiple servers
Hive A data warehouse system for Hadoop that facilitates data
summarization, ad-hoc queries, and the analysis of large
datasets stored in Hadoop-compatible file systems. Hive
provides a mechanism to project structure onto this data and
query it using a SQL-like language called HiveQL. HiveQL
programs are converted into MapReduce programs
HBase An open-source, distributed, versioned, column-oriented store
modeled after Google's BigTable
Pig A high-level data-flow language for expressing MapReduce
programs for analyzing large HDFS distributed data sets
Mahout A scalable machine learning and data mining library
Oozie A workflow/coordination system to manage Hadoop jobs
Spark A general purpose engine for large scale data processing in-
memory. It supports analytical applications that wish to make
use of stream processing, SQL access to columnar data and
analytics on distributed in-memory data
Zookeeper A coordination service for distributed applications
Big data has created
new analytical
workloads beyond
those typical of
traditional data
warehouses and data
marts
Hadoop has emerged
as a platform very
much at the centre of
big data analytics
Hive is a data
warehouse system
for Hadoop that
provides a
mechanism to project
structure on Hadoop
data
Hive provides an
interface whereby
SQL can be
converted into
MapReduce
programs
Mahout offers a
whole library of
analytics that can
exploit the full power
of a Hadoop cluster

BUILDING AN ENTERPRISE DATA HUB USING MAPR
WHAT IS AN ENTERPRISE DATA HUB?
Having discussed the characteristics of new sources of data, new
analytical workloads that now need to be supported and Hadoop as a
key platform for analytics, a key question is “How does Hadoop fit into
an existing analytical environment?” A key emerging role for Hadoop is
that of an Enterprise Data Hub as shown in Figure 1.
Figure 1
An enterprise data hub is a managed and governed Hadoop environment in
which to land raw data, refine it and publish new insights that can be delivered
to authorised users throughout the enterprise, either on-demand or on a
subscription basis. These users may want to add the new insights to existing
data warehouses and data marts to enrich what they already know and/or
conduct further analyses for competitive advantage.
The Enterprise Data Hub consists of:
• A managed data reservoir
• A governed data refinery
• Published, protected and secure high value insights
• Long-term storage of archived data from data warehouses
The$Managed$Hadoop$Enterprise$Data$Hub$Includes$A$Data$
Reservoir$and$A$Data$Refinery$And$A$Zone$For$New$Insights$
Parse$&$Prepare$Data$in$Hadoop$(MapReduce)$
Transform$&$Cleanse$Data$in$Hadoop$(MapReduce)$
$Discover$data$in$Hadoop$
ELT$
work$
Jflow$
sandbox$
other$data$
sandbox$ sandbox$
Data$
Reservoir$
Load$data$into$Hadoop$
Data$
Refinery$
New$high$value$
Insights$
(pub/sub)$
EDW$
Graph$
DBMS$$
DW$
appliance$

All of this is made available in a secure, well-governed environment. Within the
enterprise data hub, a data reservoir is where raw data is landed, collected and
organised before it enters into the data refinery where data and data
relationships are discovered, data is parsed, profiled, cleansed, transformation
and integrated. It is then made available to data scientists who may combine it
with other trusted data such as master data or historical data from a data
warehouse before conducting exploratory analyses in a sandbox environment
to identify and produce new business insights. These insights are the output of
the data refining process. They are made available to other authorised users in
the enterprise by first describing them using common vocabulary data
definitions and then publishing them into a new insights zone where they
become available for distribution to other platforms and analytical projects. In
addition, cold data not being used
THE MAPR DISTRIBUTION FOR HADOOP AS AN ENTERPRISE DATA HUB
PLATFORM
MapR is a vendor that provides a Hadoop platform upon which to build a
managed Enterprise Data Hub. MapR was founded in 2009, is based in San
Jose, California, and offers three editions of its Hadoop Distribution. These are:
MapR
Distribution for
Apache Hadoop
Description
MapR M3
Standard
Edition
A free community edition that includes HBase™, Pig, Hive,
Mahout, Cascading, Sqoop, Flume etc. It includes POSIX-
compliant NFS file system access.
MapR M5
Enterprise
Edition
This enterprise edition includes HBase™, Pig, Hive, Mahout,
Cascading, Sqoop, Flume, Impala, Spark etc. M5 edition is a
no single point of failure edition with high availability and
data protection features such as JobTracker HA, no
NameNode HA, Snapshots and Mirroring to synchronise
data across clusters.
MapR M7
Enteprise
Database
Edition
The MapR M7 Database Edition includes all the capabilities
of M5 plus enterprise-grade modifications to HBase to make
it more dependable and faster.
MapR Disaster Recovery and Data Protection
MapR has also strengthened Hadoop by adding support for disaster recovery
and data protection to their M5 and M7 Hadoop distribution offerings.
In the area of disaster recovery, MapR provides remote mirroring to keep a
synchronized copy of data at a remote site, so that processing can continue
uninterrupted in the case of a disaster. Management of multiple on-site or
geographically dispersed clusters is available with the MapR Control System.
With respect to data protection, MapR has no single points of failure, with no
NameNode HA and distributed cluster metadata. MapR Snapshot provides
point-in-time recovery while MapR Mirroring offers business continuity.
Everything in the MapR distribution is logged,
and able to restart with the intent
MapR has three
editions of its
distribution for
Hadoop
MapR provides a
distribution for
Hadoop that
includes over 20
Apache Hadoop
projects

that the entire cluster is self-healing and self-tuning. The JobTracker and
NameNode have been re-engineered to be distributed and replicated. Direct
Access NFS HA means that clients do not idle waiting for unavailable servers
and rolling upgrades make sure that the cluster is always available. In addition,
workload management is also supported including job isolation, job placement
control, logical volumes, SLA enforcement and enterprise access control to
isolate and secure data access.
Hadoop Workloads and MapR Extensions
Specific examples of where Hadoop is particularly well suited include:
• Offloading and accelerating data warehouse ELT processing at low cost
• Exploratory analysis of un-modeled multi-structured data
• Extreme analytics – for example having to run millions of scoring
models concurrently on millions of accounts to detect “cramming” fraud
on credit cards. This is an activity whereby fraudsters attempt to steal
small amounts of money from large numbers of credit card accounts by
associating false charges with vague financial services and hoping
consumers just don’t notice. Running millions of analytical models
concurrently on data is typically not a workload you would see running
in a data warehouse.
• The long term storage of data and reprocessing of archived data
warehouse data for rapid selective retrieval
These are all workloads that you would expect to find in an Enterprise Data
Hub. The MapR enhancements to the underlying data platform which powers
their Hadoop distribution, provide capabilities needed to support these
including continuous data capture, offloading of ELT processing, exploratory
analytics, long term storage of archived warehouse data, and selective retrieval
of it for analytical processing. Let’s look at these in more detail with particular
focus on the data refining process and offloading ELT processing and some
analytical workloads from data warehouses.
THE DATA REFINERY - ACCELERATING ETL PROCESSING AT LOW COST
The evolution of ETL on big data platforms like Hadoop has mirrored that on
traditional data warehouses. First of all, hand-crafted ETL programs were
written to provision data into Hadoop, transform and integrate it for exploratory
analysis. The problem with this approach is that even if these programs exploit
the multi-processor, multi-server Hadoop platform, development is slow and
expensive requiring scarce MapReduce programming skills.
ETL tool vendors responded by announcing support for Hadoop as both a
target to provision data for exploratory analysis and a source to move derived
insights from Hadoop into data warehouses. However, while this approach
works, ETL processing occurs outside the Hadoop environment and so is
unable to exploit the scalability of the Hadoop platform to deal with the
characteristics of big data.
In order to get scalability, ETL vendors have evolved their products to exploit
Hadoop by implementing ELT processing much like it did on data warehouse
systems. The difference now, however, is that all the data is being loaded into
a Hadoop cluster for ELT processing via generated 3GL, Hive or Pig ELT jobs
Hand-crafted ETL
programs were
initially created in a
Hadoop environment
ETL servers then
emerged to handle
big data integration
to load into Hadoop

running natively on a low cost Hadoop cluster. This is shown in Figures 2 and
3. It is this capability that is so attractive to many companies who are looking
for a way to offload ELT processing from data warehouses and create an
enterprise data hub. Offloading ELT processing to Hadoop frees up
considerable capacity on data warehouse platforms thereby potentially
avoiding expensive data warehouse upgrades. This is especially significant as
transaction data volumes continue to grow and new big data sources become
available for analysis.
Figure 2
Several ETL tool vendors have now re-written transforms to run on Hadoop.
Also, several have added new tools and transformations to handle large
volumes of multi-structured data. Therefore what we are seeing in Hadoop
environments is a full repetition of what happened with ETL tools on MPP
RDBMSs. This time however, the attraction is that ELT processing can
potentially be done at a much lower cost given that a Hadoop cluster is a much
cheaper platform to store any kind of data. It also opens up the way for ELT
processing to be offloaded from data warehouses and to exploit the full power
of a Hadoop cluster to get the scalability needed to improve performance in a
big data environment.
Figure 3
Data$Cleansing$and$Integra/on$Tool$$
Extract Parse Clean Transform AnalyseLoad Insights$
Op/on$1$
ETL$tool$generates$HQL$or$
convert$generated$SQL$to$
HQL$
Op/on$2$
ETL$tool$generates$
Pig$La/n$
(compiler$converts$
every$transform$to$a$
map$reduce$job)$
Op/on$3$
ETL$tool$generates$
3GL$MapReduce$code$
Scaling$ETL$Transforma/ons$By$Genera/ng$Pig,$Hive$or$
3GL$MapReduce$Code$for$InMHadoop$ELT$Processing$
Provisioning Data Into Hadoop for Exploratory Analysis of
Multi-Structured Data Using In-Hadoop ELT processing
Web
logs
Generated
MapReduce
ELT jobs
business
insight
sandbox sandboxUn-modelled multi-
structured data
Structured
data
Filtered
sensor
data
ELTProcessing
ETL servers that
handle big data
cleansing and
integration outside of
Hadoop are unlikely to
scale well – they need
to exploit Hadoop
ETL processing in a
big data environment
needs to exploit
Hadoop to get
scalability at low cost
Several ETL vendors
have ported their
software to Hadoop
to run ELT map-
reduce processing
on multi-structured
big data

The aforementioned attractions of the Figure 3 pattern are leading many
companies to consider placing a significant slice of their structured and multi-
structured data into Hadoop Enterprise Data Hub for ELT processing before
making subsets of it available exploratory analysis on Hadoop itself or to other
data warehouse and NoSQL platforms. The only challenge with this is the
migration of existing ELT jobs running on existing data warehouses which is
helped by Hive being able to convert ELT-generated SQL (used to transform
data) into map reduce or potentially even in-memory Spark programs.
THE DATA REFINERY - EXPLORATORY ANALYSIS
In addition to ETL processing, another analytical workload very much part of
the data refining process in a Hadoop Enterprise Data Hub is the exploratory
analysis of complex data types. This is where data scientists in the enterprise
data hub often use freeform exploratory tools like search and/or develop and
run batch MapReduce or Spark analytic applications (written in languages like
Java, Python, Scala and R) to conduct exploratory analyses on un-modelled
data stored in the Hadoop system. The purpose of this analysis in the data
refining process, is to derive structured insight from unstructured data that may
then be stored in HBase, Hive or moved into a data warehouse for further
analysis. With Hadoop MapReduce, these analytical programs are copied to
thousands of compute nodes in a Hadoop cluster where the data is located in
order to run the batch analysis in parallel. In addition, in-Hadoop analytics in
the Mahout library can run in parallel close to the data to exploit the full power
of a Hadoop cluster. The addition of Spark means that MapR can improve
performance of exploratory analytical applications by exploiting in-memory
processing. Data access can also be simplified by using SQL via Shark on
Spark instead of lower level HDFS APIs. Insight derived from this exploratory
analysis can then be published and moved into data warehouses to enrich
value or to other analytical data stores for further analysis.
Accelerating Big Data Consumption and Filtering Using
Automated Analytics During in-Hadoop ELT Processing
One of the challenges with big data is dealing with the data deluge. Companies
have to step up to the challenge that data is arriving faster than they can
consume it. They therefore have to find a way to automate the consumption
and refining of popular data to deal with the data deluge and bring data into the
enterprise in a timely way for the business to analyse and act on.
That means not only doing ETL processing on Hadoop, but also being able to
analyse data during ELT processing as part of the data refining process. An
example might be to score Twitter sentiment during ELT processing of Twitter
data on Hadoop so that negative sentiment can be identified quickly and
attributed to customers, products, brand or business functions like customer
service.
Figure 4 takes in-Hadoop ELT processing further than in Figure 1 by doing in-
line automated analytics on Hadoop during ELT processing. In this way,
popular structured and multi-structured data sources may be consumed and
refined in a more automated way, thereby expediting time to value. In addition,
if data scientists have built custom map reduce or Spark based analytics on
this kind of data, then it is potentially possible to exploit these analytics during
ELT processing. This means that once data scientists have built analytics that
analyse data, they can be used and re-used in analyses to produce insight
Embedding in-
Hadoop analytics in
Hadoop-based ELT
processing allows
data to be
consumedmore
rapidly and in a more
automated way

from new big data sources. Note how in Figure 4 automated analysis allows
the ELT workflow to span entire data refining process.
Figure 4
In this way it becomes possible to build an “Enterprise Data Filter” that can
speed consumption of new and existing data sources to expedite the
production of new high value business insights.
KEY MAPR FEATURES THAT MEET ENTERPRISE DATA HUB REQUIREMENTS
Given what is potentially possible, the next question is ‘How has MapR
enhanced its Hadoop distributions to support the operation of an Enterprise
Data Hub?” Since its inception, MapR has sought to enhance a critical part of
the open source Apache Hadoop stack to improve availability, open up access
and improve overall performance and usability.
MapR has strengthened Apache Hadoop considerably to improve its resilience,
improve performance and make it easier to manage. For example, they have
removed multiple single points of failure in Apache Hadoop and introduced
data mirroring across clusters, using asynchronous replication, to support
failover and disaster recovery. In addition, they have added data snapshots, a
heat map management console and have improved performance through data
compression and by rewriting the intermediate shuffle phase that occurs after
Map and before Reduce. HBase has also been strengthened (MapR M7
Edition) to remove compactions and Spark has been added to facilitate high
performance in-memory analytics. All of this makes the MapR distributions
much more enterprise-grade.
Key features of MapR M5 and M7 that benefit ETL processing include:
The$Managed$Hadoop$Enterprise$Data$Hub$Includes$Automated$
Analyses$to$Refine$Data$Much$More$Rapidly$
Parse$&$Prepare$Data$in$Hadoop$(MapReduce)$
Transform$&$Cleanse$Data$in$Hadoop$(MapReduce)$
$Discover$data$in$Hadoop$
ELT$
work$
Gflow$
other$data$
Data$
Reservoir$
Load$data$into$Hadoop$
Data$
Refinery$
New$high$value$
Insights$
(pub/sub)$
EDW$
Graph$
DBMS$$
Automated$InvocaOon$of$Custom$Built$&$PreGbuilt$
AnalyOcs$on$Hadoop$
DW$
appliance$
MapR has
strengthened
Apache Hadoop to
improve resilience
and performance
MapR features that
benefit ETL
processing include
high availability and
Direct Access NFS

• No single point of failure edition with high availability features such as
JobTracker High Availability and no-NameNode High Availability. In a
global business where ETL processing may need to happen several
times within the day, high availability is very important
• MapR Direct Access™ NFS which enables real-time read/write data
flows via the industry-standard Network File System (NFS) protocol.
With MapR Direct Access NFS, any remote client can simply mount the
cluster. This means that application servers can write their log files and
other data directly into the cluster, rather than writing it first to direct- or
network-attached storage. This reduces the need for log collection
tools that may require agents on every application server. Application
servers can either write data directly into the cluster or use standard
tools like Rsync to synchronize data between local disks and the
cluster. Either way, it means that ELT processing on log data needed
for clickstream analytics could potentially avoid the extract and load into
MapR M5/M7, thus speeding up the process of making this data
available for analysis. It also reduces data latency which can be
important in many applications.
• MapR Snapshots allow for online point-in-time data recovery without
replication of data. A volume snapshot is consistent (atomic), does not
copy data, and does not impact performance. Snapshots can potentially
help ETL processing in the event of a failure where an ETL job may
need to be restarted from the point of failure or at least from an
intermediate snapshot taken at specific points in the ETL processing.
• The re-writing of the intermediate shuffle phase that occurs after Map
and before Reduce can really help improve ETL performance for ETL
tools generating map-reduce ETL jobs via Hive, Pig or natively with a
3GL language such as Java.
• Automatic data compression can also help improve performance and so
speed up data refinery processes.
• The MapR data protection and disaster recovery capabilities make the
MapR Distributions for Hadoop suitable for long-term storage of big
data and data warehouse archived data which can then be selectively
reprocessed in specific analyses even though it is offloaded from the
data warehouse.
• The MapR remote mirroring capability also allows ELT and analytical
workloads in a data refinery to be spread across clusters in order to get
more work done.
DATA HUB ELT PROCESSING WITH MAPR HADOOP DISTRIBUTIONS
With respect to ELT processing on Hadoop, MapR partners with a number of
ETL tool vendors and ETL accelerator vendors that can run ELT jobs on MapR
M5 and M7 Hadoop clusters. These include:
• Informatica
• Pentaho
• Talend
Direct Access NFS
speeds up the ability
to capture data and
support change data
capture
MapR snapshots
help to support
‘point-in-time’ restart
of big data ETL
processing in the
event of a failure
without going back
to the beginning
Rewrite of shuffle
and data
compression helps
improve ETL
processing
performance
MapR has several
ETL partners that
run on Hadoop to
accelerate ETL
processing
MapR together with
its ETL partners can
support most ETL
patterns

• Syncsort
Using these partner technologies in combination with the MapR M5 and M7
editions, the following patterns are supported:
• Accelerating big data consumption and filtering by using in-Hadoop
analytics during ELT processing
• In-Hadoop ELT processing via MapReduce-based transformations
• Provisioning data into Hadoop sandboxes for exploratory analysis as
part of a data refining process
• Feeding data warehouses from Hadoop to accelerate multi-platform
analytics
In terms of scalability, adding more Hadoop nodes to the cluster allows you to
process this data at speed due to greater I/O parallelism and compute power.
For very large amounts of data, a MapR Hadoop cluster can be spun up in a
cloud environment like the Google Compute Engine to undertake this work at a
much lower cost than trying to configure a cluster of similar size in-house.
With respect to restart of ELT processing, MapR snapshots taken at specific
points in ELT processing makes it possible to restart any data refinery ELT
process quickly.
MapR also provides random read/write access in its Hadoop distribution. In
Apache Hadoop, HDFS is normally append-only, but one of the key features of
the MapR Distribution is Direct Access NFS. In the context of ELT, Direct
Access NFS allows faster and more convenient loading of data into the
Hadoop cluster, thereby reducing data latency. ETL tools that support Change
Data Capture can write changes straight into the MapR Hadoop cluster. An
example of a MapR partner ETL tool vendor that can do this is Talend.
Change data capture is very important to ETL performance, especially on large
volumes of data.
Finally in terms of ELT performance, the re-writing of the intermediate shuffle
phase that occurs after Map and before Reduce will benefit sorting,
aggregation, hashing and pattern matching transformations all of which are
mainstream transformation functionality needed in most ELT jobs. This
functionality, together with data compression, will boost performance.
These features allow MapR to support the following key ETL patterns.
HADOOP AS A DATA HUB FOR ALL ANALYTICAL PLATFORMS
Given these enhancements, the MapR Distribution could potentially be used
not only to offload processing from data warehouses but also to create a low
cost data hub (see Figure 5). An Enterprise Data Hub is the foundation pattern
in data and new insight provisioning in a multi-platform analytical environment.
It would be possible to use MapR M5 or M7 as an Enterprise Data Hub that
cleans, transforms and integrates data from multiple structured and multi-
structured sources and provisions trusted data into any analytical platform in a
big data analytical ccosystem for subsequent analysis. This includes:
• MapR M5/M7 Hadoop distribution itself, where sandboxes are created
for data scientists to conduct exploratory analysis as part of a data
refining process
ETL processing can
take place on
premises or in the
cloud
ETL jobs can be
designed with restart
in mind using MapR
snapshots
ETL jobs can handle
change data capture
using MapR Direct
Access NFS
ETL performance is
accelerated using
MapR shuffle
processing and data
compression
Archiving data from
data warehouses
into Hadoop is also
needed

• Enterprise data warehouses
• Data marts
• Analytical appliances
• Other NoSQL databases e.g. graph databases
Figure 5
FEEDING DATA WAREHOUSES FROM A HADOOP DATA HUB TO PRODUCE
NEW INSIGHT FROM ENRICHED DATA
Having transformed data on Hadoop and produced insights from it, there is a
need to add any new insights produced to existing environments to add to what
is already known. This means being able to also have ETL tools extract
derived insights produced on Hadoop from that platform and integrate them
with other structured data going into a data warehouse (see Figure 6). This
may happen on Hadoop itself (i.e. push data into a data warehouse) or outside
of Hadoop to pull the data into a data warehouse. In this way we can facilitate
multi-platform analytics that may start by analysing data on Hadoop and end up
offering new insights to self-service BI users accessing a data warehouse.
By embedding analytics in Hadoop ELT processing, it is also potentially
possible to turn ELT workflows into multi-platform analytical workflows.
Data Hub - Consume, Clean, Integrate, Analyse And Provision
Data From Hadoop To Any Analytical Platform
Generated
MapReduce
ELT jobs
business
insight
sandbox
ELT Processing
feedssensors
!"#$%
&'()%
RDBMS Files office docssocial Cloud
*+,*-./0123%
Web logs
web services
NoSQL DB
e.g. graph DB
EDW
DW & marts
mar
t
DW
Appliance
Advanced Analytics
(structured data)
Exploratory analysis
ETL tools also need
to extract data from
Hadoop and provide
it to data
warehouses and
other NoSQL data
stores
Providing new
insights from
Hadoop into data
warehouses is a
very common
requirement

Figure 6
ARCHIVING DATA WAREHOUSE DATA INTO HADOOP
In order to maximise the value from ETL processing in a big data environment,
it must be possible to move data from Hadoop into other NoSQL and relational
analytical platforms and vice-versa. This includes orchestrating multi-platform
analytical ETL workflows to solve complex analytical problems.
Figure 7
Figure 7 shows this capability. With two-way data movement it becomes
possible to take dimension data into Hadoop and archive data from data
warehouses into Hadoop. It also becomes possible to manage data across all
data stores and analytical platforms in Hadoop. What this also shows is that
data management software has to scale much more than before, not just to
handle big data volumes, but also to handle data movement across platforms
during analytical processing. It will also be used for data archiving across
platforms. Therefore ETL scalability on a robust, highly available self-healing
Hadoop platform like MapR is even more important going forward.
Leveraging Hadoop for Data Integration on Massive Volumes
of Data to Bring Additional Insights Into a DW
Cloud Data
HDFS
Extract
DW
E
T
LMap/ Reduce data
transformation
and analytics
applications
Transform
e.g. PIG, JAQL
Cloud Data e.g. Deriving insight from huge
volumes of social web content on
sites like Twitter, Facebook. Digg,
Myspace, TripAdvisor, LinkedIn….for
sentiment analytics
Hundreds of
terabytes up
to petabytes
relevant
insight
Operational
systems
EDW
MDM SystemDW & marts
NoSQL DB
e.g. graph DB
Advanced Analytic
(multi-structured data)
mart
DW
Appliance
Advanced Analytics
(structured data)
Need to Manage the Supply of Consistent Data And
Archive Data Across The Entire Analytical Ecosystem
Enterprise Information Management Tool Suite
Stream
processing
C
R
U
D
Prod
Asset
Cust
actions
feedssensors
!"#$%
&'()%
RDBMS Files office docssocial Cloud
*+,*-./0123%
Web logs
web services
New
New
New
New
New New New New NewNew
New
New
C
R
U
D
Prod
Asset
Cust
Archiving data from
data warehouses
into Hadoop is also
needed
There is a need for
two-way movement
of data between
Hadoop and other
data stores and to
manage data across
all analytical
platforms in a big
data ecosystem

CONCLUSION
The emergence of new data sources and the need to analyse everything from
unstructured data to live event streams has led many organisations to realise
that the spectrum of analytical workloads is now so broad that they cannot all
be dealt with in a single enterprise data warehouse. Companies now need
multiple analytical platforms in addition to traditional data warehouses and data
marts to manage big data workloads. New big data platforms like Hadoop,
stream processing engines, and NoSQL graph DBMSs are all emerging as
platforms optimised for specific analytical workloads that need to be added to
the enterprise analytical setup.
This has resulted in a more complex analytical environment that has put much
more emphasis on data management to keep data consistent across big data
workload-optimised analytical platforms and traditional data warehouses.
ETL software now has to deal with multiple data types, very large data volumes
and high velocity event streams as well as handling traditional ETL processing
into data warehouses. In addition, this software must now deal with the need to
rapidly move data between big data and traditional data warehousing platforms
during the execution of analytical workloads. All this is needed while continuing
to deliver value for money and without causing dramatic increases in cost. Data
refinery processes have to be fast, efficient, simple to use and cost-effective.
Several data management vendors now support Hadoop as both a source and
a target in their ETL tools. They are also generate HiveQL, Pig or Java to
create map reduce ELT processing jobs that fully exploit massively parallel
Hadoop clusters. To support faster filtering and consumption of data, we are
also seeing ETL tools starting to support the embedding of analytics into ETL
workflows so that fully automated analytical workflows can be built to speed up
the rate at which organisations can consume, analyse and act on data.
It is this combination of Hadoop with data management software and in-
Hadoop analytics that opens up the attractive proposition of creating a low cost
Enterprise Data Hub (as shown in Figure 4) that manages and accelerates the
data refinery process in an end-to-end big data analytical ecosystem. The
MapR Distribution for Hadoop is well suited to this role and can also support
the offloading of a subset of analytical processing from data warehouses. The
Enterprise Data Hub is not just for data warehousing however. Its job is to
become the foundation for cleansing, transforming and integrating structured
and multi-structured data from multiple sources before provisioning filtered data
and new insights to any platform in the entire big data analytical ecosystem for
subsequent analysis.
MapR, with its enterprise-grade Hadoop distribution together with its partners,
look ready for the challenge.
Business is now
demanding more
analytical power to
analyse new sources of
structured and multi-
structured data
ETL tools have to scale
to support more data,
more complex
transformations and
faster data loading
Support for Hadoop and
rapid data movement
between Hadoop and
data warehouses is
needed
Hadoop in combination
with ETL processing
offers an attractive low
cost way to implement a
data management hub
for the entire big data
analytical ecosystem
MapR can help
customers take ETL
processing to the next
level

About Intelligent Business Strategies
Intelligent Business Strategies is a research and consulting company whose
goal is to help companies understand and exploit new developments in
business intelligence, analytical processing, data management and enterprise
business integration. Together, these technologies help an organisation
become an intelligent business.
Author
Mike Ferguson is Managing Director of Intelligent Business Strategies Limited.
As an analyst and consultant he specialises in business intelligence and
enterprise business integration. With over 31 years of IT experience, Mike has
consulted for dozens of companies on business intelligence strategy, big data,
data governance, master data management, enterprise architecture, and SOA.
He has spoken at events all over the world and written numerous articles. He
has written many articles, and blogs providing insights on the industry.
Formerly he was a principal and co-founder of Codd and Date Europe Limited
– the inventors of the Relational Model, a Chief Architect at Teradata on the
Teradata DBMS and European Managing Director of Database Associates, an
independent analyst organisation. He teaches popular master classes in Big
Data Analytics, New Technologies for Business Intelligence and Data
Warehousing, Enterprise Data Governance, Master Data Management, and
Enterprise Business Integration.
INTELLIGENT

BUSINESS

STRATEGIES

Water Lane, Wilmslow
Cheshire, SK9 5BG
England
Telephone: (+44)1625 520700
Internet URL: www.intelligentbusiness.biz
E-mail: info@intelligentbusiness.biz
Copyright © 2014 by Intelligent Business Strategies
All rights reserved

MapR Data Hub White Paper V2 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to MapR Data Hub White Paper V2 2014

Similar to MapR Data Hub White Paper V2 2014 (20)

MapR Data Hub White Paper V2 2014