Whitepaper-The-Data-Lake-3_0

Implementing
the Enterprise
Data Lake
A four-stage approach to building a massive,
easily accessible, flexible and scalable Big Data repository
What is a Data Lake, and how does it help meet the challenges of Big Data? Will
the Enterprise Data Warehouse (EDW) and the Data Lake coexist? If so, how?
This paper explores what it takes to get started on the journey toward
incorporating a Data Lake into an organization’s architecture.
www.impetus.com

Data is like money. The world runs on it. We believe it’s valuable, and we are pretty
sure we can’t have too much of it. We save it, store it, move it around in various
formats, and use it for all kinds of purposes. We will also take it in just about any
form we can get it. We might not know how we’re going to use it, but we’re willing
to collect it now and figure out what to do with it later. For the most part, we’re
happy as long as it just keeps on coming in.
At least, that’s been true when we thought of data as finite. But, with the advent
of Big Data, it’s pouring in like never before - and while we still want it in any form
we can get it, structured or unstructured, the issues of storing, managing, and
analyzing it are becoming more complex.
Interestingly, despite all the advances of technology, the money that data most
closely resembles isn’t the conceptual kind that we refer to when we say it’s “on
paper” or that is being traded in nano-seconds in financial markets, but rather,
data is more like cold hard cash. It’s essentially a physical thing - heavy, a bit
cumbersome and hard to move. Large data transfers can take days. You don’t
want to move it very frequently, and you don’t want to move it very far. And if you
do have to move it, you’d prefer to transport it as safely as possibly, maybe via
armored truck, for example.
So what to do with it all? How do we store it? How do we manage it? How do we use
it? These are the questions that lead us to the Data Lake. But what is a Data Lake,
and how does it help meet the challenges of Big Data?
Introduction
2
Defining the Problem
Before we talk about what a Data Lake1 is, let’s define the problem and a term
or two a little more clearly. First there is unstructured data. While organizations are
amassing massive amounts of data, much of it is unstructured. Unstructured data
refers to information that either does not have a pre-defined data model or is not
organized in a pre-defined manner. And that’s a concern because if it’s not pre-
defined or structured, it’s usually difficult to analyze. And if you can’t analyze it,
what’s the point? Additionally, structuring it is a laborious, time-consuming task.
However, unstructured data accounts for much of the explosion that is Big Data. It
is also widely understood as holding the most promise for gaining new, actionable
insights.
Nearly all the data that lives outside of databases is unstructured, including
images, videos and log files produced by computers, machines and sensors. Even
this document is unstructured data. The sheer volume of it is staggering;
unstructured data makes up at least 80 percent of all digitally stored data. And as
the data-driven economy grows, the amount of unstructured data only grows.
So, what to do? Enter the Data Lake.
The power of Big Data is
the ability to correlate
data.
Real-time enterprise
data analytics are
really all about
improving decision
making.
1 “Data Lake” is one of several interchangeable common terms that could have been used here. Some others are Big
Data repository, unified data architecture, and modern data architecture.

3
What is a Data Lake?
A “Data Lake” is one of several interchangeable terms that are commonly used.
Some others are Big Data repository, unified data architecture, modern data
architecture, as well as others. No matter what it is called, the concept is the same:
take the data that’s coming in -- in unstructured torrents -- and store it where it’s
more accessible, flexible and scalable and able to be analyzed without the need to
structure it. Here are two common definitions:
• A Data Lake is a massive, easily accessible, flexible and scalable
data repository
• A Data Lake is an enterprise-wide data management platform for analyzing
disparate sources of data in their native format
Data Lakes include structured, semi-structured, and unstructured data. They are
built on inexpensive computer hardware and are designed for storing
uncategorized pools of data, including the following:
• Data immediately of interest
• Data potentially of interest
• Data for which the intended usage is not yet known
The information in the Data Lake is consolidated - both structured and
unstructured - in a manner that allows inquiries to be made across the entire body
of data all at once. This ability to access all of the data is especially appealing
because the true power of Big Data is the ability to correlate insights across
previously siloed data warehouses or between structured and unstructured data
sources.
Drivers for the Data Lake
Real-time enterprise data analytics are all about improving decision making. With
so much data traditionally siloed into different data warehouses, such as
Enterprise Resource Planning (ERP), Customer Relationship Management
(CRM), Human Resource Management (HCM), and others, it’s almost impossible to
make correlations across these somewhat captive data sources. The thinking now
is to integrate data silos, build infrastructures that empower data science to
improve analytics, and reduce time to market by faster analytical processing.
These are some of the drivers behind the new architecture that is a Data Lake.
Limitations of the Current Enterprise Data Warehouse?
Why can’t we just use what we’ve always used? For the last several decades,
EDWs have served as the foundation for business intelligence and data
discovery. The world of data warehousing was a more predictable world, a world
where structures and formatting could take place in advance, where hypotheses
were drawn, where the content of data was known, and where the scope was
restricted and pre-defined. Thus, the metaphor of a warehouse
worked well because, like a warehouse, one could organize data the way one
might stock shelves.

4
Limitations of the Current Enterprise Data Warehouse?
Why can’t we just use what we’ve always used? For the last several decades,
EDWs have served as the foundation for business intelligence and data
discovery. The world of data warehousing was a more predictable world, a world
where structures and formatting could take place in advance, where hypotheses
were drawn, where the content of data was known, and where the scope was
restricted and pre-defined. Thus, the metaphor of a warehouse
worked well because, like a warehouse, one could organize data the way one
might stock shelves.
In the real world of actual shelf-stocking, there are obviously some significant
constraints related to physical space and cost. For example, if you were running a
logistics company or a retail company, you’d need a physical structure as well as
shelves and floor space to store all your palettes and boxes. You’d need a plan for
what and where to store your inventory as well as labels for everything so that you
could efficiently organize the space for shipping and managing your goods.
In the world of data storage, the constraints are the same as physical
inventory: it costs to store and it costs to move. However, part of the complexity
in the realm of Big Data is that we no longer know what’s in the metaphorical
boxes, let alone how we’re going to use it. EDWs are not only costly but they are
not structured to handle the complexities of Big Data. Warehouses work when
you can define what’s in the boxes and all the associated logistics. With Big
Data, that’s no longer possible.
Thus, what makes the EDW great is also what restricts it. Data warehouses store
data in specific static structures and categories that dictate the kind of analysis
that is possible on that data. With the emergence of Big Data, this approach falls
short because it’s impossible to determine what the data might hold. And in cases
where analysis is required in real-time, formatting in advance is not an option. The
point here is that the world of data has become fluid, not static. And data is
available in such massive volumes, at near
real-time velocity and in its many unstructured forms.
Real data discovery requires that analysts are able to ask questions of the data as
train-of-thought demands. The real questions only emerge during the process of the
analysis itself which is not easily done in the EDW world.
What’s needed is an approach that allows business users to siphon off or distill the
information they need as they need it. This is the shift that underpins the business
Data Lake and which changes the game to something that better meets the needs
of today’s responsive, real-time enterprise.
Capabilities of the Data Lake
What capabilities does the Data Lake bring to the enterprise? What are the
capabilities that didn’t exist prior to the Data Lake?
Here’s our list of the top four:

5
Active Archive: Providing Access to Historic Data
An active archive provides a single place to store all your data, in any format, at any
volume, indefinitely.
Enterprise data governance policies and in many cases, federal law deals with the
management of data, including how long data must be retained. An active archive
allows you to address these kinds of compliance requirements and deliver data on
demand to satisfy internal and external regulatory demands. Because it is secure,
you control who sees what; because it delivers governance and lineage services,
you can trace the access and evolution of your data over time.
Having access to historic data—both raw source information and data archived from
conventional relational stores--is extremely valuable in use cases where there are
requirements to deliver data on demand, such as health records that you need to
keep for a certain amount of time or finaincal records that need to be kept for
regulatory compliance. This capability is very useful for attaining immediate insights
instead of waiting for long drawn out processes.
Self-Service Exploratory Business Intelligence
In many ways, stored data can be the best currency an organization has to offer.
But like all other investments, this comes at a price, as organizations must
dedicate money and time to protecting their data. Users frequently want access to
enterprise data for reporting, exploration, and analysis. But production enterprise
data warehouse systems often need to be protected from casual use so they can
run the mission-critical financial and operational workloads they support.
An enterprise Data Lake allows users to explore data, with full security, using
traditional interactive business intelligence tools via SQL and keyword search.
Advanced Analytics: Far Beyond Data Sampling
A secret of many data analysis projects is that calculations are based on
representative samples of the data rather than full sets. While this works nicely if
you’re trying to determine whether an Oscar nominated film is likely to win an
Academy Award based on its popularity compared to other nominated films, what if
you are a researcher at the Center for Disease Control trying to determine the
cause of an outbreak, or an investment banker trying to measure risk, or a retailer
wanting to understand customer motivations across channels? The bottom line is
that you are much better off with the ability to search and analyze data on a large
scale and a granular level, rather than just sampling the data. Data Lakes provide
that level of fidelity.
Low Cost of Transformation: Optimizing Workloads
Extract, Transform and Load (ETL) workloads that had previously run on
expensive systems can now migrate to the enterprise Data Lake, where they run
at very low cost, in parallel, and much faster than before. Optimizing the
placement of these workloads frees capacity on high-end analytic and data
warehouse systems, making them more valuable by allowing them to
concentrate on the business critical applications that they process.

Adjunct to the EDW
With this new found wealth of data, we’re also experiencing a cultural shift toward
democratization of data. Leading organizations are now saying, “Here, have some.
Let’s let everybody have access and see what they can do with it.” This is due to the
growing recognition that the more an organization can harness information, the
greater the value they derive from deeper insights. For this reason, organizations
are removing blocks to innovation and transforming the way data contributes to
success.
Serving as an adjunct to the EDW, a Data Lake can:
• Work in tandem with the EDW and allow you to offload colder data to the
Data Lake.
• Allow you to work with unstructured data.
• Support a cultural shift towards democratized data access.
• Contain costs while continuing to do more with more data.
Sounds compelling, doesn’t it? But how do you know if you are ready for a Data
Lake?
Determining Readiness: Some Questions to
Ask Here are some of the critical drivers that indicate readiness:
• Are you working with a growing amount of unstructured data?
• Are your lines of business demanding even more unstructured data?
• Does your organization need a unified view of information?
• Do you need to be able to perform real-time analysis on source data?
• Is your organization moving toward a culture of democratized data access?
• Would your organization benefit from elasticity of scale?
Design Principles
If you are ready for a Data Lake, there are some key design principles that we
recommend following. Here are our top five:
• Discovery without limitations
• Low latency at any scale
• Movement from a reactive model to predictive model
• Elasticity in infrastructure
• Affordability
One of the most significant reasons to build a Data Lake is to encourage
experimentation and to move from an intuition-based model to a more
comprehensive, empirical, data science driven model. In order to enable that kind
of experimentation and analytical finesse to thrive, you have to allow for discovery
without limitations. By that, we mean you have to be willing and able to give users
access to all that data. You should also be able to perform low latency queries of
data at any scale.
Now let’s talk about what it takes to build a Data Lake.
6

Not a Big Bang Approach:
The Four Stages to Building a Data Lake
From our experience, building a Data Lake doesn’t happen all at once; instead,
there are stages of maturity:
1. Handling and ingesting data at scale
2. Building analytical muscle
3. Leveraging the strengths of the EDW and the Data Lake
4. Adopting broadly
1. Handling and Ingesting Data at Scale
This first stage involves getting the architecture in place and learning to acquire and
transform data at scale. This is when your organization will need to determine the
new and existing data sources that it can leverage. These data sources are then
integrated and the volume and variety of data is ingested at high velocity in Hadoop
storage. At this stage, the analytics may be rather simple, consisting possibly of
simple transformations, but it’s an important step in discovering how to make
Hadoop work the way you want.
2. Building Analytical Muscle
The second stage focuses on improving the ability to transform and analyze data.
This is where you begin to really leverage the enterprise Data Lake. For example,
your organization can start building batch, mini-batch, and real-time applications
for enterprise usage, exploratory analytics, and predictive use cases. Various tools
and frameworks are used at this stage. The EDW and the Data Lake start working
together.
3. Leveraging the Strengths of the EDW and the Data Lake
This is when the orchestra really starts to play. Here, in the third stage, you will
want to get data and analytics into the hands of as many people in the
organization as possible. Democratization begins. This is also the stage where the
EDW and Hadoop-based Big Data lake truly co-exist, allowing the enterprise to
leverage the strengths of each architecture.
4. Adopting Broadly
The fourth level is the highest stage of maturity in the Data Lake. Enterprise
capabilities are added to the Data Lake. Broad adoption of unified Data Lake
architectures requires information governance, compliance, security, auditing,
meta-data management and information lifecycle management capabilities. Not
addressing these issues may result in slow enterprise adoption and runs the risk
that eventually the Data Lake becomes a “data swamp.”
The Big Data Lake: Understanding the Essentials
Understanding the layers of the data warehouse is an imperative step in the Big
Data journey. The following pages elucidate the components of a Big Data
warehouse and the methodology to set it up.
Components of Big Data Warehouse
While requirements and specific business needs may vary within each
organization, the following diagram lists the major components of a Big
Data warehouse. 7

8
Data Sources
An enterprise usually has the following sources of data:
• A relational database such as Oracle, DB2, PostgreSQL, SQL Server and
the like
• Multiple disparate, unstructured and semi-structured data sources which
may have data in formats such as flat files, XML, JSON or CSV
• Existing systems may further provide integration data in EDI or other B2B
exchange formats
• Machine data and network elements generating huge volumes of data
Hadoop Distribution
Hadoop is the most popular choice for Big Data today and is available in open
source Apache and commercial distribution packages. Hadoop consists of a file
system called HDFS (Hadoop distributed file system) which forms the key data
storage layer of the Big Data warehouse. Other options are also available such as
GPFS (from IBM) and S3 (from Amazon cloud).
Data Ingestion
It is imperative to set up reliable and scalable data ingestion mechanisms to
bring data in from data sources to the Hadoop file system.
• For connecting relational database, the most popular option is Sqoop and
database specific connectors
• For streaming data, Apache Kafka and Flume are quite popular
Figure 1: Big Data warehouse reference architecture by Impetus
Relational Data
(PostgreSQL,
Oracle, DB2,
SQL Server...)
Flat Files/XML
/JSON/CSV
Exsiting
Systems
Machine Data/
Network
Elements
Data Sources
Business
Intelligence
Machine Data
Analysis
Predictive
& Statistical
Analysis
Data
Discovery
Visualization
& reporting
Kafka/Flume
Sqoop/
Connectors
Existing
DI Tools
REST | JDBC
SOAP | Custom
Streaming
Data Ingestion
Virtualization
Federation
Delivery
Polyglot Mapper
Management
Provision
Monitor
Performance
Optimization
Security
Data Quality
Lifecycle
Management
Data
Classification
Information
Policy
Governance
Data Query
Relational Offload
Engine (Hive/Pig/Drill/Spark
SQL)
Search
Pipelines
Cubes
Access
Data Store/
NoSQL

9
• Organizations that need to leverage streaming data sources to setup an entire
topology of streaming source, ingestion, in flight transformation and data
persistence would need to use one of the common CEP (Complex Event
Processing) or streaming engines such as Apache Storm or StreamAnalytix.
• Organizations that need to leverage their existing Data Integration (DI)
connectors may need custom scripts to integrate using REST, SOAP or JDBC
components.
Data Query:
For the data resident in HDFS, there are a multitude of query engines such as Pig,
Hive, Apache Drill and Spark SQL.
Many organizations, however, would prefer to re-use their SQL scripts and
procedures written for their traditional enterprise data warehouse. Because they
have already invested millions of dollars in the traditional SQL and PL/SQL engines,
it is understandable that organizations want to explore mechanisms that will allow
them to offload the data tables from relational data warehouses to a Big Data
warehouse while keeping their querying/reporting scripts intact.
Tools and solutions are available now from organizations like Impetus
Technologies which help the enterprise to offload the expensive computing from
relational data warehouses to Big Data warehouses without re-writing the entire
processing layer.
Data Stores
Along with the HDFS, there is a trend to couple a data store or NoSQL database
like HBase, Cassandra in the Big Data warehouse. These stores provides
additional functions in the form of columnar database, schema less storage,
querying, OLAP/OLTP and application integration.
Access
With data stored in the HDFS or NoSQL layer, organizations demand increasingly
complex access requirements. These include features from the traditional world
like search and cubes functions. There are also new tools which help manage
complex pipelines of jobs where output of one query may be fed as input into
another.
Governance
Ensuring data quality is the key reason for data governance in the Big Data
warehouse.
• While the aim of the Big Data warehouse is to offer a Data Lake integrated
with all enterprise data sources, it is still essential to apply data quality
regulations to ensure the Data Lake does not turn into a data swamp
• Similarly, data users increasingly need to make sure that they are able to
manage the data through its entire lifecycle
• Classifying data based on various segments like business user group (for
instance, marketing, risk management, operations etc.) ensures control and
governance of the data
• It is also imperative to define enterprise level information policies to avoid
breaches and ensure control on the entire data warehouse

10
Stages for Setting up a Big Data Warehouse
The journey to a Big Data warehouse is a multi-stage process. It requires selecting
the right tools, keeping a clear vision, and following a process to lay out an
effective and integrated data warehouse. The key stages for setting up Big Data
warehouse broadly include the following:
Stage One: Handle and Ingest Data at Scale
As the first stage, the organization needs to determine the existing and new
data source that it can leverage.
• The data sources are integrated and the variety of voluminous data is
ingested at high velocity in Hadoop storage
• The incoming data may be of varied formats ranging from unstructured data,
structured data, streaming data, machine geo-spatial time series data or
external data sets like social media data
Virtualization
Organizations have found that despite their best intentions and use cases, they
may have to deal with the coexistence of the enterprise data warehouse and the
Big Data warehouse for a period of time. To ensure consistent results with
appropriate polyglot querying, the federation of data and delivery mechanisms are
essential.
Management
To provision and monitor the entire cluster, operations team need to have handy
tools and dashboards for cluster management. It is not un-common to find
engineers trying to diagnose performance of MapReduce jobs and queries in their
quest for optimal speed and minimal resource consumption. Security is another
key aspect of the warehouse with authentication and role based authorization
behind defined gateways.
Business Intelligence
The goal of the warehouse is to achieve business insights and generate
intelligence for the organization. To achieve that objective, business teams
need to be empowered with various visualization and reporting tools. Data
scientists can also help discover data patterns using predictive/statistical
algorithms and machine data analytics.
Figure 2: handle and ingest data at scale
Streaming
Unstructured
Structured
Machine
Geospatial Time
Series
External Social
Landing and
Ingestion
Big Data
Storage

11
Stage Two: Build the Analytical Muscle
In order to leverage the enterprise Data Lake in Hadoop, the organization builds
batch, mini-batch and real time applications for enterprise usage, exploratory
analytics and predictive use cases. Various tools and frameworks are utilized in
this stage as organizations begin to:
• Explore advanced querying engines starting with MapReduce, and moving onto
Apache Spark, Flink etc. for interactive results
• Build use cases for both batch and real time processing using streaming
solutions like Apache Storm and StreamAnalytix
• Build analytic applications for enterprise adoption, exploration, discovery and
prediction
Step Three: Enterprise Data Warehouse and Big Data Warehouse Work in
Unison
In a real world scenario, enterprise data warehouse (EDW) and Hadoop based
data warehouse (BDW) would co-exist as follows:
• The organization would leverage data and specific capabilities to its
advantage
• Rather than disposing of the expensive enterprise warehouse, organizations like
to leverage that along with Big Data technologies
• Once a stable and mature Big Data warehouse is achieved, both the EDW and
BDW work in unison to achieve multi work-load distribution and offload to each
other as required
• Specialized solutions like Impetus relational offload solution aid in
helping organizations save millions of dollars with superior time and
schedule benefits
Figure 3: Build the analytical muscle
Streaming
Unstructured
Structured
Machine
Geospatial Time
Series
External Social
Landing and
Ingestion
Big Data
Storage
Real-Time Applications
Provisioning,Workflow, Monitoring and Security
Enterprise
Applications
Exploration
& Discovery
Predictive
Applications

12
Step Four: Achieving Enterprise Maturity in the Warehouse
For a unified data warehouse, various enterprise ready capabilities are needed.
These are particularly pertinent in the case of information governance, metadata
management and information lifecycle management.
While organizations may begin with basic governance paradigms, as they mature in
the journey it becomes essential to have more sophisticated practices and policies.
Further, the appetite of the user is no longer satisfied by simply exploring data and
managing it through its lifespan. Instead, organizations need tools and utilities to
handle the laborious tasks of discovering data, proving insights and managing the
information lifecycle.
Figure 4: Enterprise data warehouse and Big Data warehouse work in unison
Streaming
Unstructured
Structured
Machine
Geospatial Time
Series
External Social
Enterprise
Applications
Exploration
& Discovery
Predictive
Applications
Landing and
Ingestion
Big Data
Storage
Provisioning, Workflow, Monitoring and Security
Traditional
Data
Repositories
RDBMS MPP
Figure 5: achieving enterprise maturity in the warehouse
Streaming
Unstructured
Structured
Machine
Geospatial Time
Series
External Social
Enterprise
Applications
Exploration
& Discovery
Predictive
Applications
Landing and
Ingestion
Big Data
Storage
Provisioning, Workflow, Monitoring and Security
Traditional
Data
Repositories
RDBMS MPP
Governance, Information Lifecycle, Enterprise Meta Data Management

Summary
• Success in the business world today depends on high-quality, accessible
information that offers actionable insight
• The emergence of the Data Lake is inspired by the need to manage and
exploit new types of data
• Organizations need new architectures capable of integrating both structured and
unstructured data, managing massive data sets, and delivering
real-time analytics
• Hadoop based Big Data architectures have changed the face of the data
warehouse and business intelligence and analytics world forever
• There is a growing acceptance of the concept of a Data Lake as a
cornerstone component of an enterprise Big Data strategy
• Enterprise adoption of Big Data architectures is accelerating as a way to
enable broad new opportunities across all functions and industries
• Big Data warehouse architectures will complement rather than replace the
enterprise data warehouses of today
• As the enterprise data warehouse and the Data Lake work together in unison,
they provide a synergy of capabilities, ultimately allowing analysts to do more
with data and derive business results faster
Conclusion
With each new technological leap, a buzz of excitement and uncertainty surrounds
it. The Big Data technologies and Hadoop ecosystem in particular seem to have
captured the imagination across the IT landscape. However, the journey to a Big
Data warehouse requires adequate planning and vision combined with robust
engineering and technology practices.
Smart, conscientious Data Lake development can drive greater value to and
from a company’s data, while tapping the incredible power of innovation to drive
real insight.
Impetus Technologies specializes in this niche area and has unraveled the
mysteries surrounding the Big Data warehouse for many customers and continues
to be at top of the continuously evolving ecosystem. With products, solutions and
services expertise available at every required stage of the Big Data journey,
enterprise adopters can collaborate with niche providers to weigh in the options
and chart out a visionary path in their industry segment. Big Data provides an
unparalleled opportunity to turn information into a competitive asset, yielding
revenues for business and glory for IT like
never before.
© 2015 Impetus Technologies,
Inc. All rights reserved. Product
and company names mentioned
herein may be trademarks of
their respective companies.
June 2015
Impetus is focused on creating big business impact through Big Data Solutions for Fortune
1000 enterprises across multiple verticals. The company brings together a unique mix of
software products, consulting services, Data Science capabilities and technology expertise.
It offers full life-cycle services for Big Data implementations and real-time streaming
analytics, including technology strategy, solution architecture, proof of concept, production
implementation and on-going support to its clients.
Visit http://impetus.com or write to us at bigdata@impetus.com
About Impetus

Whitepaper-The-Data-Lake-3_0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Whitepaper-The-Data-Lake-3_0

Similar to Whitepaper-The-Data-Lake-3_0 (20)

Whitepaper-The-Data-Lake-3_0