SlideShare a Scribd company logo
Implementing
the Enterprise
Data Lake
A four-stage approach to building a massive,
easily accessible, flexible and scalable Big Data repository
What is a Data Lake, and how does it help meet the challenges of Big Data? Will
the Enterprise Data Warehouse (EDW) and the Data Lake coexist? If so, how?
This paper explores what it takes to get started on the journey toward
incorporating a Data Lake into an organization’s architecture.
www.impetus.com
Data is like money. The world runs on it. We believe it’s valuable, and we are pretty
sure we can’t have too much of it. We save it, store it, move it around in various
formats, and use it for all kinds of purposes. We will also take it in just about any
form we can get it. We might not know how we’re going to use it, but we’re willing
to collect it now and figure out what to do with it later. For the most part, we’re
happy as long as it just keeps on coming in.
At least, that’s been true when we thought of data as finite. But, with the advent
of Big Data, it’s pouring in like never before - and while we still want it in any form
we can get it, structured or unstructured, the issues of storing, managing, and
analyzing it are becoming more complex.
Interestingly, despite all the advances of technology, the money that data most
closely resembles isn’t the conceptual kind that we refer to when we say it’s “on
paper” or that is being traded in nano-seconds in financial markets, but rather,
data is more like cold hard cash. It’s essentially a physical thing - heavy, a bit
cumbersome and hard to move. Large data transfers can take days. You don’t
want to move it very frequently, and you don’t want to move it very far. And if you
do have to move it, you’d prefer to transport it as safely as possibly, maybe via
armored truck, for example.
So what to do with it all? How do we store it? How do we manage it? How do we use
it? These are the questions that lead us to the Data Lake. But what is a Data Lake,
and how does it help meet the challenges of Big Data?
Introduction
2
Defining the Problem
Before we talk about what a Data Lake1 is, let’s define the problem and a term
or two a little more clearly. First there is unstructured data. While organizations are
amassing massive amounts of data, much of it is unstructured. Unstructured data
refers to information that either does not have a pre-defined data model or is not
organized in a pre-defined manner. And that’s a concern because if it’s not pre-
defined or structured, it’s usually difficult to analyze. And if you can’t analyze it,
what’s the point? Additionally, structuring it is a laborious, time-consuming task.
However, unstructured data accounts for much of the explosion that is Big Data. It
is also widely understood as holding the most promise for gaining new, actionable
insights.
Nearly all the data that lives outside of databases is unstructured, including
images, videos and log files produced by computers, machines and sensors. Even
this document is unstructured data. The sheer volume of it is staggering;
unstructured data makes up at least 80 percent of all digitally stored data. And as
the data-driven economy grows, the amount of unstructured data only grows.
So, what to do? Enter the Data Lake.
The power of Big Data is
the ability to correlate
data.
Real-time enterprise
data analytics are
really all about
improving decision
making.
1 “Data Lake” is one of several interchangeable common terms that could have been used here. Some others are Big
Data repository, unified data architecture, and modern data architecture.
3
What is a Data Lake?
A “Data Lake” is one of several interchangeable terms that are commonly used.
Some others are Big Data repository, unified data architecture, modern data
architecture, as well as others. No matter what it is called, the concept is the same:
take the data that’s coming in -- in unstructured torrents -- and store it where it’s
more accessible, flexible and scalable and able to be analyzed without the need to
structure it. Here are two common definitions:
• A Data Lake is a massive, easily accessible, flexible and scalable
data repository
• A Data Lake is an enterprise-wide data management platform for analyzing
disparate sources of data in their native format
Data Lakes include structured, semi-structured, and unstructured data. They are
built on inexpensive computer hardware and are designed for storing
uncategorized pools of data, including the following:
• Data immediately of interest
• Data potentially of interest
• Data for which the intended usage is not yet known
The information in the Data Lake is consolidated - both structured and
unstructured - in a manner that allows inquiries to be made across the entire body
of data all at once. This ability to access all of the data is especially appealing
because the true power of Big Data is the ability to correlate insights across
previously siloed data warehouses or between structured and unstructured data
sources.
Drivers for the Data Lake
Real-time enterprise data analytics are all about improving decision making. With
so much data traditionally siloed into different data warehouses, such as
Enterprise Resource Planning (ERP), Customer Relationship Management
(CRM), Human Resource Management (HCM), and others, it’s almost impossible to
make correlations across these somewhat captive data sources. The thinking now
is to integrate data silos, build infrastructures that empower data science to
improve analytics, and reduce time to market by faster analytical processing.
These are some of the drivers behind the new architecture that is a Data Lake.
Limitations of the Current Enterprise Data Warehouse?
Why can’t we just use what we’ve always used? For the last several decades,
EDWs have served as the foundation for business intelligence and data
discovery. The world of data warehousing was a more predictable world, a world
where structures and formatting could take place in advance, where hypotheses
were drawn, where the content of data was known, and where the scope was
restricted and pre-defined. Thus, the metaphor of a warehouse
worked well because, like a warehouse, one could organize data the way one
might stock shelves.
4
Limitations of the Current Enterprise Data Warehouse?
Why can’t we just use what we’ve always used? For the last several decades,
EDWs have served as the foundation for business intelligence and data
discovery. The world of data warehousing was a more predictable world, a world
where structures and formatting could take place in advance, where hypotheses
were drawn, where the content of data was known, and where the scope was
restricted and pre-defined. Thus, the metaphor of a warehouse
worked well because, like a warehouse, one could organize data the way one
might stock shelves.
In the real world of actual shelf-stocking, there are obviously some significant
constraints related to physical space and cost. For example, if you were running a
logistics company or a retail company, you’d need a physical structure as well as
shelves and floor space to store all your palettes and boxes. You’d need a plan for
what and where to store your inventory as well as labels for everything so that you
could efficiently organize the space for shipping and managing your goods.
In the world of data storage, the constraints are the same as physical
inventory: it costs to store and it costs to move. However, part of the complexity
in the realm of Big Data is that we no longer know what’s in the metaphorical
boxes, let alone how we’re going to use it. EDWs are not only costly but they are
not structured to handle the complexities of Big Data. Warehouses work when
you can define what’s in the boxes and all the associated logistics. With Big
Data, that’s no longer possible.
Thus, what makes the EDW great is also what restricts it. Data warehouses store
data in specific static structures and categories that dictate the kind of analysis
that is possible on that data. With the emergence of Big Data, this approach falls
short because it’s impossible to determine what the data might hold. And in cases
where analysis is required in real-time, formatting in advance is not an option. The
point here is that the world of data has become fluid, not static. And data is
available in such massive volumes, at near
real-time velocity and in its many unstructured forms.
Real data discovery requires that analysts are able to ask questions of the data as
train-of-thought demands. The real questions only emerge during the process of the
analysis itself which is not easily done in the EDW world.
What’s needed is an approach that allows business users to siphon off or distill the
information they need as they need it. This is the shift that underpins the business
Data Lake and which changes the game to something that better meets the needs
of today’s responsive, real-time enterprise.
Capabilities of the Data Lake
What capabilities does the Data Lake bring to the enterprise? What are the
capabilities that didn’t exist prior to the Data Lake?
Here’s our list of the top four:
5
Active Archive: Providing Access to Historic Data
An active archive provides a single place to store all your data, in any format, at any
volume, indefinitely.
Enterprise data governance policies and in many cases, federal law deals with the
management of data, including how long data must be retained. An active archive
allows you to address these kinds of compliance requirements and deliver data on
demand to satisfy internal and external regulatory demands. Because it is secure,
you control who sees what; because it delivers governance and lineage services,
you can trace the access and evolution of your data over time.
Having access to historic data—both raw source information and data archived from
conventional relational stores--is extremely valuable in use cases where there are
requirements to deliver data on demand, such as health records that you need to
keep for a certain amount of time or finaincal records that need to be kept for
regulatory compliance. This capability is very useful for attaining immediate insights
instead of waiting for long drawn out processes.
Self-Service Exploratory Business Intelligence
In many ways, stored data can be the best currency an organization has to offer.
But like all other investments, this comes at a price, as organizations must
dedicate money and time to protecting their data. Users frequently want access to
enterprise data for reporting, exploration, and analysis. But production enterprise
data warehouse systems often need to be protected from casual use so they can
run the mission-critical financial and operational workloads they support.
An enterprise Data Lake allows users to explore data, with full security, using
traditional interactive business intelligence tools via SQL and keyword search.
Advanced Analytics: Far Beyond Data Sampling
A secret of many data analysis projects is that calculations are based on
representative samples of the data rather than full sets. While this works nicely if
you’re trying to determine whether an Oscar nominated film is likely to win an
Academy Award based on its popularity compared to other nominated films, what if
you are a researcher at the Center for Disease Control trying to determine the
cause of an outbreak, or an investment banker trying to measure risk, or a retailer
wanting to understand customer motivations across channels? The bottom line is
that you are much better off with the ability to search and analyze data on a large
scale and a granular level, rather than just sampling the data. Data Lakes provide
that level of fidelity.
Low Cost of Transformation: Optimizing Workloads
Extract, Transform and Load (ETL) workloads that had previously run on
expensive systems can now migrate to the enterprise Data Lake, where they run
at very low cost, in parallel, and much faster than before. Optimizing the
placement of these workloads frees capacity on high-end analytic and data
warehouse systems, making them more valuable by allowing them to
concentrate on the business critical applications that they process.
Adjunct to the EDW
With this new found wealth of data, we’re also experiencing a cultural shift toward
democratization of data. Leading organizations are now saying, “Here, have some.
Let’s let everybody have access and see what they can do with it.” This is due to the
growing recognition that the more an organization can harness information, the
greater the value they derive from deeper insights. For this reason, organizations
are removing blocks to innovation and transforming the way data contributes to
success.
Serving as an adjunct to the EDW, a Data Lake can:
• Work in tandem with the EDW and allow you to offload colder data to the
Data Lake.
• Allow you to work with unstructured data.
• Support a cultural shift towards democratized data access.
• Contain costs while continuing to do more with more data.
Sounds compelling, doesn’t it? But how do you know if you are ready for a Data
Lake?
Determining Readiness: Some Questions to
Ask Here are some of the critical drivers that indicate readiness:
• Are you working with a growing amount of unstructured data?
• Are your lines of business demanding even more unstructured data?
• Does your organization need a unified view of information?
• Do you need to be able to perform real-time analysis on source data?
• Is your organization moving toward a culture of democratized data access?
• Would your organization benefit from elasticity of scale?
Design Principles
If you are ready for a Data Lake, there are some key design principles that we
recommend following. Here are our top five:
• Discovery without limitations
• Low latency at any scale
• Movement from a reactive model to predictive model
• Elasticity in infrastructure
• Affordability
One of the most significant reasons to build a Data Lake is to encourage
experimentation and to move from an intuition-based model to a more
comprehensive, empirical, data science driven model. In order to enable that kind
of experimentation and analytical finesse to thrive, you have to allow for discovery
without limitations. By that, we mean you have to be willing and able to give users
access to all that data. You should also be able to perform low latency queries of
data at any scale.
Now let’s talk about what it takes to build a Data Lake.
6
Not a Big Bang Approach:
The Four Stages to Building a Data Lake
From our experience, building a Data Lake doesn’t happen all at once; instead,
there are stages of maturity:
1. Handling and ingesting data at scale
2. Building analytical muscle
3. Leveraging the strengths of the EDW and the Data Lake
4. Adopting broadly
1. Handling and Ingesting Data at Scale
This first stage involves getting the architecture in place and learning to acquire and
transform data at scale. This is when your organization will need to determine the
new and existing data sources that it can leverage. These data sources are then
integrated and the volume and variety of data is ingested at high velocity in Hadoop
storage. At this stage, the analytics may be rather simple, consisting possibly of
simple transformations, but it’s an important step in discovering how to make
Hadoop work the way you want.
2. Building Analytical Muscle
The second stage focuses on improving the ability to transform and analyze data.
This is where you begin to really leverage the enterprise Data Lake. For example,
your organization can start building batch, mini-batch, and real-time applications
for enterprise usage, exploratory analytics, and predictive use cases. Various tools
and frameworks are used at this stage. The EDW and the Data Lake start working
together.
3. Leveraging the Strengths of the EDW and the Data Lake
This is when the orchestra really starts to play. Here, in the third stage, you will
want to get data and analytics into the hands of as many people in the
organization as possible. Democratization begins. This is also the stage where the
EDW and Hadoop-based Big Data lake truly co-exist, allowing the enterprise to
leverage the strengths of each architecture.
4. Adopting Broadly
The fourth level is the highest stage of maturity in the Data Lake. Enterprise
capabilities are added to the Data Lake. Broad adoption of unified Data Lake
architectures requires information governance, compliance, security, auditing,
meta-data management and information lifecycle management capabilities. Not
addressing these issues may result in slow enterprise adoption and runs the risk
that eventually the Data Lake becomes a “data swamp.”
The Big Data Lake: Understanding the Essentials
Understanding the layers of the data warehouse is an imperative step in the Big
Data journey. The following pages elucidate the components of a Big Data
warehouse and the methodology to set it up.
Components of Big Data Warehouse
While requirements and specific business needs may vary within each
organization, the following diagram lists the major components of a Big
Data warehouse. 7
8
Data Sources
An enterprise usually has the following sources of data:
• A relational database such as Oracle, DB2, PostgreSQL, SQL Server and
the like
• Multiple disparate, unstructured and semi-structured data sources which
may have data in formats such as flat files, XML, JSON or CSV
• Existing systems may further provide integration data in EDI or other B2B
exchange formats
• Machine data and network elements generating huge volumes of data
Hadoop Distribution
Hadoop is the most popular choice for Big Data today and is available in open
source Apache and commercial distribution packages. Hadoop consists of a file
system called HDFS (Hadoop distributed file system) which forms the key data
storage layer of the Big Data warehouse. Other options are also available such as
GPFS (from IBM) and S3 (from Amazon cloud).
Data Ingestion
It is imperative to set up reliable and scalable data ingestion mechanisms to
bring data in from data sources to the Hadoop file system.
• For connecting relational database, the most popular option is Sqoop and
database specific connectors
• For streaming data, Apache Kafka and Flume are quite popular
Figure 1: Big Data warehouse reference architecture by Impetus
Relational Data
(PostgreSQL,
Oracle, DB2,
SQL Server...)
Flat Files/XML
/JSON/CSV
Exsiting
Systems
Machine Data/
Network
Elements
Data Sources
Business
Intelligence
Machine Data
Analysis
Predictive
& Statistical
Analysis
Data
Discovery
Visualization
& reporting
Kafka/Flume
Sqoop/
Connectors
Existing
DI Tools
REST | JDBC
SOAP | Custom
Streaming
Data Ingestion
Virtualization
Federation
Delivery
Polyglot Mapper
Management
Provision
Monitor
Performance
Optimization
Security
Data Quality
Lifecycle
Management
Data
Classification
Information
Policy
Governance
Data Query
Relational Offload
Engine (Hive/Pig/Drill/Spark
SQL)
Search
Pipelines
Cubes
Access
Data Store/
NoSQL
9
• Organizations that need to leverage streaming data sources to setup an entire
topology of streaming source, ingestion, in flight transformation and data
persistence would need to use one of the common CEP (Complex Event
Processing) or streaming engines such as Apache Storm or StreamAnalytix.
• Organizations that need to leverage their existing Data Integration (DI)
connectors may need custom scripts to integrate using REST, SOAP or JDBC
components.
Data Query:
For the data resident in HDFS, there are a multitude of query engines such as Pig,
Hive, Apache Drill and Spark SQL.
Many organizations, however, would prefer to re-use their SQL scripts and
procedures written for their traditional enterprise data warehouse. Because they
have already invested millions of dollars in the traditional SQL and PL/SQL engines,
it is understandable that organizations want to explore mechanisms that will allow
them to offload the data tables from relational data warehouses to a Big Data
warehouse while keeping their querying/reporting scripts intact.
Tools and solutions are available now from organizations like Impetus
Technologies which help the enterprise to offload the expensive computing from
relational data warehouses to Big Data warehouses without re-writing the entire
processing layer.
Data Stores
Along with the HDFS, there is a trend to couple a data store or NoSQL database
like HBase, Cassandra in the Big Data warehouse. These stores provides
additional functions in the form of columnar database, schema less storage,
querying, OLAP/OLTP and application integration.
Access
With data stored in the HDFS or NoSQL layer, organizations demand increasingly
complex access requirements. These include features from the traditional world
like search and cubes functions. There are also new tools which help manage
complex pipelines of jobs where output of one query may be fed as input into
another.
Governance
Ensuring data quality is the key reason for data governance in the Big Data
warehouse.
• While the aim of the Big Data warehouse is to offer a Data Lake integrated
with all enterprise data sources, it is still essential to apply data quality
regulations to ensure the Data Lake does not turn into a data swamp
• Similarly, data users increasingly need to make sure that they are able to
manage the data through its entire lifecycle
• Classifying data based on various segments like business user group (for
instance, marketing, risk management, operations etc.) ensures control and
governance of the data
• It is also imperative to define enterprise level information policies to avoid
breaches and ensure control on the entire data warehouse
10
Stages for Setting up a Big Data Warehouse
The journey to a Big Data warehouse is a multi-stage process. It requires selecting
the right tools, keeping a clear vision, and following a process to lay out an
effective and integrated data warehouse. The key stages for setting up Big Data
warehouse broadly include the following:
Stage One: Handle and Ingest Data at Scale
As the first stage, the organization needs to determine the existing and new
data source that it can leverage.
• The data sources are integrated and the variety of voluminous data is
ingested at high velocity in Hadoop storage
• The incoming data may be of varied formats ranging from unstructured data,
structured data, streaming data, machine geo-spatial time series data or
external data sets like social media data
Virtualization
Organizations have found that despite their best intentions and use cases, they
may have to deal with the coexistence of the enterprise data warehouse and the
Big Data warehouse for a period of time. To ensure consistent results with
appropriate polyglot querying, the federation of data and delivery mechanisms are
essential.
Management
To provision and monitor the entire cluster, operations team need to have handy
tools and dashboards for cluster management. It is not un-common to find
engineers trying to diagnose performance of MapReduce jobs and queries in their
quest for optimal speed and minimal resource consumption. Security is another
key aspect of the warehouse with authentication and role based authorization
behind defined gateways.
Business Intelligence
The goal of the warehouse is to achieve business insights and generate
intelligence for the organization. To achieve that objective, business teams
need to be empowered with various visualization and reporting tools. Data
scientists can also help discover data patterns using predictive/statistical
algorithms and machine data analytics.
Figure 2: handle and ingest data at scale
Streaming
Unstructured
Structured
Machine
Geospatial Time
Series
External Social
Landing and
Ingestion
Big Data
Storage
11
Stage Two: Build the Analytical Muscle
In order to leverage the enterprise Data Lake in Hadoop, the organization builds
batch, mini-batch and real time applications for enterprise usage, exploratory
analytics and predictive use cases. Various tools and frameworks are utilized in
this stage as organizations begin to:
• Explore advanced querying engines starting with MapReduce, and moving onto
Apache Spark, Flink etc. for interactive results
• Build use cases for both batch and real time processing using streaming
solutions like Apache Storm and StreamAnalytix
• Build analytic applications for enterprise adoption, exploration, discovery and
prediction
Step Three: Enterprise Data Warehouse and Big Data Warehouse Work in
Unison
In a real world scenario, enterprise data warehouse (EDW) and Hadoop based
data warehouse (BDW) would co-exist as follows:
• The organization would leverage data and specific capabilities to its
advantage
• Rather than disposing of the expensive enterprise warehouse, organizations like
to leverage that along with Big Data technologies
• Once a stable and mature Big Data warehouse is achieved, both the EDW and
BDW work in unison to achieve multi work-load distribution and offload to each
other as required
• Specialized solutions like Impetus relational offload solution aid in
helping organizations save millions of dollars with superior time and
schedule benefits
Figure 3: Build the analytical muscle
Streaming
Unstructured
Structured
Machine
Geospatial Time
Series
External Social
Landing and
Ingestion
Big Data
Storage
Real-Time Applications
Provisioning,Workflow, Monitoring and Security
Enterprise
Applications
Exploration
& Discovery
Predictive
Applications
12
Step Four: Achieving Enterprise Maturity in the Warehouse
For a unified data warehouse, various enterprise ready capabilities are needed.
These are particularly pertinent in the case of information governance, metadata
management and information lifecycle management.
While organizations may begin with basic governance paradigms, as they mature in
the journey it becomes essential to have more sophisticated practices and policies.
Further, the appetite of the user is no longer satisfied by simply exploring data and
managing it through its lifespan. Instead, organizations need tools and utilities to
handle the laborious tasks of discovering data, proving insights and managing the
information lifecycle.
Figure 4: Enterprise data warehouse and Big Data warehouse work in unison
Streaming
Unstructured
Structured
Machine
Geospatial Time
Series
External Social
Enterprise
Applications
Exploration
& Discovery
Predictive
Applications
Landing and
Ingestion
Big Data
Storage
Real-Time Applications
Provisioning, Workflow, Monitoring and Security
Traditional
Data
Repositories
RDBMS MPP
Figure 5: achieving enterprise maturity in the warehouse
Streaming
Unstructured
Structured
Machine
Geospatial Time
Series
External Social
Enterprise
Applications
Exploration
& Discovery
Predictive
Applications
Landing and
Ingestion
Big Data
Storage
Real-Time Applications
Provisioning, Workflow, Monitoring and Security
Traditional
Data
Repositories
RDBMS MPP
Governance, Information Lifecycle, Enterprise Meta Data Management
Summary
• Success in the business world today depends on high-quality, accessible
information that offers actionable insight
• The emergence of the Data Lake is inspired by the need to manage and
exploit new types of data
• Organizations need new architectures capable of integrating both structured and
unstructured data, managing massive data sets, and delivering
real-time analytics
• Hadoop based Big Data architectures have changed the face of the data
warehouse and business intelligence and analytics world forever
• There is a growing acceptance of the concept of a Data Lake as a
cornerstone component of an enterprise Big Data strategy
• Enterprise adoption of Big Data architectures is accelerating as a way to
enable broad new opportunities across all functions and industries
• Big Data warehouse architectures will complement rather than replace the
enterprise data warehouses of today
• As the enterprise data warehouse and the Data Lake work together in unison,
they provide a synergy of capabilities, ultimately allowing analysts to do more
with data and derive business results faster
Conclusion
With each new technological leap, a buzz of excitement and uncertainty surrounds
it. The Big Data technologies and Hadoop ecosystem in particular seem to have
captured the imagination across the IT landscape. However, the journey to a Big
Data warehouse requires adequate planning and vision combined with robust
engineering and technology practices.
Smart, conscientious Data Lake development can drive greater value to and
from a company’s data, while tapping the incredible power of innovation to drive
real insight.
Impetus Technologies specializes in this niche area and has unraveled the
mysteries surrounding the Big Data warehouse for many customers and continues
to be at top of the continuously evolving ecosystem. With products, solutions and
services expertise available at every required stage of the Big Data journey,
enterprise adopters can collaborate with niche providers to weigh in the options
and chart out a visionary path in their industry segment. Big Data provides an
unparalleled opportunity to turn information into a competitive asset, yielding
revenues for business and glory for IT like
never before.
© 2015 Impetus Technologies,
Inc. All rights reserved. Product
and company names mentioned
herein may be trademarks of
their respective companies.
June 2015
Impetus is focused on creating big business impact through Big Data Solutions for Fortune
1000 enterprises across multiple verticals. The company brings together a unique mix of
software products, consulting services, Data Science capabilities and technology expertise.
It offers full life-cycle services for Big Data implementations and real-time streaming
analytics, including technology strategy, solution architecture, proof of concept, production
implementation and on-going support to its clients.
Visit http://impetus.com or write to us at bigdata@impetus.com
About Impetus

More Related Content

What's hot

Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
Caserta
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
Shallote Dsouza
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
Caserta
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
Impetus Technologies
 
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...emermell
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data Warehouses
Tom Donoghue
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
DATAVERSITY
 
Better Architecture for Data: Adaptable, Scalable, and Smart
Better Architecture for Data: Adaptable, Scalable, and SmartBetter Architecture for Data: Adaptable, Scalable, and Smart
Better Architecture for Data: Adaptable, Scalable, and Smart
Paul Boal
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
Dux Chandegra
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
Adam Doyle
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
Caserta
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
Himanshu Bari
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
Information Security Awareness Group
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slides
mark madsen
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousing
OZ Assignment help
 
DOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud JourneyDOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud Journey
Harald Erb
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
Caserta
 

What's hot (20)

Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data Warehouses
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
 
Better Architecture for Data: Adaptable, Scalable, and Smart
Better Architecture for Data: Adaptable, Scalable, and SmartBetter Architecture for Data: Adaptable, Scalable, and Smart
Better Architecture for Data: Adaptable, Scalable, and Smart
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slides
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousing
 
DOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud JourneyDOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud Journey
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 

Viewers also liked

Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Denodo
 
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Todd Fritz
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkReactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Todd Fritz
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
Amazon Web Services
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 

Viewers also liked (11)

Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
 
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkReactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 

Similar to Whitepaper-The-Data-Lake-3_0

Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investment
vijayk23x
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
sambiswal
 
Using Data Lakes to Sail Through Your Sales Goals
Using Data Lakes to Sail Through Your Sales GoalsUsing Data Lakes to Sail Through Your Sales Goals
Using Data Lakes to Sail Through Your Sales Goals
IrshadKhan682442
 
Using Data Lakes to Sail Through Your Sales Goals
Using Data Lakes to Sail Through Your Sales GoalsUsing Data Lakes to Sail Through Your Sales Goals
Using Data Lakes to Sail Through Your Sales Goals
WilliamJohnson288536
 
Using Data Lakes To Sail Through Your Sales Goals
Using Data Lakes To Sail Through Your Sales GoalsUsing Data Lakes To Sail Through Your Sales Goals
Using Data Lakes To Sail Through Your Sales Goals
KevinJohnson667312
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Datacademy.ai
 
Oracle sql plsql & dw
Oracle sql plsql & dwOracle sql plsql & dw
Oracle sql plsql & dw
Sateesh Kumar Sarvasiddi
 
Data Science.pdf
Data Science.pdfData Science.pdf
Data Science.pdf
Umar khan
 
the process of transforming data into in
the process of transforming data into inthe process of transforming data into in
the process of transforming data into in
NISHANTHM64
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | Qubole
Vasu S
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
Polestarsolutions
 
Optimising Data Lakes for Financial Services
Optimising Data Lakes for Financial ServicesOptimising Data Lakes for Financial Services
Optimising Data Lakes for Financial Services
Andrew Carr
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
Rajesh Kumar
 
Beyond the Basics - Evolving Trends in Data Storage Strategies.pdf
Beyond the Basics - Evolving Trends in Data Storage Strategies.pdfBeyond the Basics - Evolving Trends in Data Storage Strategies.pdf
Beyond the Basics - Evolving Trends in Data Storage Strategies.pdf
kelyn Technology
 
Difference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data LakeDifference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data Lake
jeetendra mandal
 
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdffinal-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
XIAOZEJIN1
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Scott Mitchell
 
Big data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and HadoopBig data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and Hadoop
SamiraChandan
 
How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management
Abhishek Sood
 

Similar to Whitepaper-The-Data-Lake-3_0 (20)

Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investment
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Using Data Lakes to Sail Through Your Sales Goals
Using Data Lakes to Sail Through Your Sales GoalsUsing Data Lakes to Sail Through Your Sales Goals
Using Data Lakes to Sail Through Your Sales Goals
 
Using Data Lakes to Sail Through Your Sales Goals
Using Data Lakes to Sail Through Your Sales GoalsUsing Data Lakes to Sail Through Your Sales Goals
Using Data Lakes to Sail Through Your Sales Goals
 
Using Data Lakes To Sail Through Your Sales Goals
Using Data Lakes To Sail Through Your Sales GoalsUsing Data Lakes To Sail Through Your Sales Goals
Using Data Lakes To Sail Through Your Sales Goals
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdf
 
Oracle sql plsql & dw
Oracle sql plsql & dwOracle sql plsql & dw
Oracle sql plsql & dw
 
Data Science.pdf
Data Science.pdfData Science.pdf
Data Science.pdf
 
the process of transforming data into in
the process of transforming data into inthe process of transforming data into in
the process of transforming data into in
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | Qubole
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
Optimising Data Lakes for Financial Services
Optimising Data Lakes for Financial ServicesOptimising Data Lakes for Financial Services
Optimising Data Lakes for Financial Services
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Beyond the Basics - Evolving Trends in Data Storage Strategies.pdf
Beyond the Basics - Evolving Trends in Data Storage Strategies.pdfBeyond the Basics - Evolving Trends in Data Storage Strategies.pdf
Beyond the Basics - Evolving Trends in Data Storage Strategies.pdf
 
Difference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data LakeDifference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data Lake
 
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdffinal-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
 
Big data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and HadoopBig data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and Hadoop
 
How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management
 

Whitepaper-The-Data-Lake-3_0

  • 1. Implementing the Enterprise Data Lake A four-stage approach to building a massive, easily accessible, flexible and scalable Big Data repository What is a Data Lake, and how does it help meet the challenges of Big Data? Will the Enterprise Data Warehouse (EDW) and the Data Lake coexist? If so, how? This paper explores what it takes to get started on the journey toward incorporating a Data Lake into an organization’s architecture. www.impetus.com
  • 2. Data is like money. The world runs on it. We believe it’s valuable, and we are pretty sure we can’t have too much of it. We save it, store it, move it around in various formats, and use it for all kinds of purposes. We will also take it in just about any form we can get it. We might not know how we’re going to use it, but we’re willing to collect it now and figure out what to do with it later. For the most part, we’re happy as long as it just keeps on coming in. At least, that’s been true when we thought of data as finite. But, with the advent of Big Data, it’s pouring in like never before - and while we still want it in any form we can get it, structured or unstructured, the issues of storing, managing, and analyzing it are becoming more complex. Interestingly, despite all the advances of technology, the money that data most closely resembles isn’t the conceptual kind that we refer to when we say it’s “on paper” or that is being traded in nano-seconds in financial markets, but rather, data is more like cold hard cash. It’s essentially a physical thing - heavy, a bit cumbersome and hard to move. Large data transfers can take days. You don’t want to move it very frequently, and you don’t want to move it very far. And if you do have to move it, you’d prefer to transport it as safely as possibly, maybe via armored truck, for example. So what to do with it all? How do we store it? How do we manage it? How do we use it? These are the questions that lead us to the Data Lake. But what is a Data Lake, and how does it help meet the challenges of Big Data? Introduction 2 Defining the Problem Before we talk about what a Data Lake1 is, let’s define the problem and a term or two a little more clearly. First there is unstructured data. While organizations are amassing massive amounts of data, much of it is unstructured. Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. And that’s a concern because if it’s not pre- defined or structured, it’s usually difficult to analyze. And if you can’t analyze it, what’s the point? Additionally, structuring it is a laborious, time-consuming task. However, unstructured data accounts for much of the explosion that is Big Data. It is also widely understood as holding the most promise for gaining new, actionable insights. Nearly all the data that lives outside of databases is unstructured, including images, videos and log files produced by computers, machines and sensors. Even this document is unstructured data. The sheer volume of it is staggering; unstructured data makes up at least 80 percent of all digitally stored data. And as the data-driven economy grows, the amount of unstructured data only grows. So, what to do? Enter the Data Lake. The power of Big Data is the ability to correlate data. Real-time enterprise data analytics are really all about improving decision making. 1 “Data Lake” is one of several interchangeable common terms that could have been used here. Some others are Big Data repository, unified data architecture, and modern data architecture.
  • 3. 3 What is a Data Lake? A “Data Lake” is one of several interchangeable terms that are commonly used. Some others are Big Data repository, unified data architecture, modern data architecture, as well as others. No matter what it is called, the concept is the same: take the data that’s coming in -- in unstructured torrents -- and store it where it’s more accessible, flexible and scalable and able to be analyzed without the need to structure it. Here are two common definitions: • A Data Lake is a massive, easily accessible, flexible and scalable data repository • A Data Lake is an enterprise-wide data management platform for analyzing disparate sources of data in their native format Data Lakes include structured, semi-structured, and unstructured data. They are built on inexpensive computer hardware and are designed for storing uncategorized pools of data, including the following: • Data immediately of interest • Data potentially of interest • Data for which the intended usage is not yet known The information in the Data Lake is consolidated - both structured and unstructured - in a manner that allows inquiries to be made across the entire body of data all at once. This ability to access all of the data is especially appealing because the true power of Big Data is the ability to correlate insights across previously siloed data warehouses or between structured and unstructured data sources. Drivers for the Data Lake Real-time enterprise data analytics are all about improving decision making. With so much data traditionally siloed into different data warehouses, such as Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), Human Resource Management (HCM), and others, it’s almost impossible to make correlations across these somewhat captive data sources. The thinking now is to integrate data silos, build infrastructures that empower data science to improve analytics, and reduce time to market by faster analytical processing. These are some of the drivers behind the new architecture that is a Data Lake. Limitations of the Current Enterprise Data Warehouse? Why can’t we just use what we’ve always used? For the last several decades, EDWs have served as the foundation for business intelligence and data discovery. The world of data warehousing was a more predictable world, a world where structures and formatting could take place in advance, where hypotheses were drawn, where the content of data was known, and where the scope was restricted and pre-defined. Thus, the metaphor of a warehouse worked well because, like a warehouse, one could organize data the way one might stock shelves.
  • 4. 4 Limitations of the Current Enterprise Data Warehouse? Why can’t we just use what we’ve always used? For the last several decades, EDWs have served as the foundation for business intelligence and data discovery. The world of data warehousing was a more predictable world, a world where structures and formatting could take place in advance, where hypotheses were drawn, where the content of data was known, and where the scope was restricted and pre-defined. Thus, the metaphor of a warehouse worked well because, like a warehouse, one could organize data the way one might stock shelves. In the real world of actual shelf-stocking, there are obviously some significant constraints related to physical space and cost. For example, if you were running a logistics company or a retail company, you’d need a physical structure as well as shelves and floor space to store all your palettes and boxes. You’d need a plan for what and where to store your inventory as well as labels for everything so that you could efficiently organize the space for shipping and managing your goods. In the world of data storage, the constraints are the same as physical inventory: it costs to store and it costs to move. However, part of the complexity in the realm of Big Data is that we no longer know what’s in the metaphorical boxes, let alone how we’re going to use it. EDWs are not only costly but they are not structured to handle the complexities of Big Data. Warehouses work when you can define what’s in the boxes and all the associated logistics. With Big Data, that’s no longer possible. Thus, what makes the EDW great is also what restricts it. Data warehouses store data in specific static structures and categories that dictate the kind of analysis that is possible on that data. With the emergence of Big Data, this approach falls short because it’s impossible to determine what the data might hold. And in cases where analysis is required in real-time, formatting in advance is not an option. The point here is that the world of data has become fluid, not static. And data is available in such massive volumes, at near real-time velocity and in its many unstructured forms. Real data discovery requires that analysts are able to ask questions of the data as train-of-thought demands. The real questions only emerge during the process of the analysis itself which is not easily done in the EDW world. What’s needed is an approach that allows business users to siphon off or distill the information they need as they need it. This is the shift that underpins the business Data Lake and which changes the game to something that better meets the needs of today’s responsive, real-time enterprise. Capabilities of the Data Lake What capabilities does the Data Lake bring to the enterprise? What are the capabilities that didn’t exist prior to the Data Lake? Here’s our list of the top four:
  • 5. 5 Active Archive: Providing Access to Historic Data An active archive provides a single place to store all your data, in any format, at any volume, indefinitely. Enterprise data governance policies and in many cases, federal law deals with the management of data, including how long data must be retained. An active archive allows you to address these kinds of compliance requirements and deliver data on demand to satisfy internal and external regulatory demands. Because it is secure, you control who sees what; because it delivers governance and lineage services, you can trace the access and evolution of your data over time. Having access to historic data—both raw source information and data archived from conventional relational stores--is extremely valuable in use cases where there are requirements to deliver data on demand, such as health records that you need to keep for a certain amount of time or finaincal records that need to be kept for regulatory compliance. This capability is very useful for attaining immediate insights instead of waiting for long drawn out processes. Self-Service Exploratory Business Intelligence In many ways, stored data can be the best currency an organization has to offer. But like all other investments, this comes at a price, as organizations must dedicate money and time to protecting their data. Users frequently want access to enterprise data for reporting, exploration, and analysis. But production enterprise data warehouse systems often need to be protected from casual use so they can run the mission-critical financial and operational workloads they support. An enterprise Data Lake allows users to explore data, with full security, using traditional interactive business intelligence tools via SQL and keyword search. Advanced Analytics: Far Beyond Data Sampling A secret of many data analysis projects is that calculations are based on representative samples of the data rather than full sets. While this works nicely if you’re trying to determine whether an Oscar nominated film is likely to win an Academy Award based on its popularity compared to other nominated films, what if you are a researcher at the Center for Disease Control trying to determine the cause of an outbreak, or an investment banker trying to measure risk, or a retailer wanting to understand customer motivations across channels? The bottom line is that you are much better off with the ability to search and analyze data on a large scale and a granular level, rather than just sampling the data. Data Lakes provide that level of fidelity. Low Cost of Transformation: Optimizing Workloads Extract, Transform and Load (ETL) workloads that had previously run on expensive systems can now migrate to the enterprise Data Lake, where they run at very low cost, in parallel, and much faster than before. Optimizing the placement of these workloads frees capacity on high-end analytic and data warehouse systems, making them more valuable by allowing them to concentrate on the business critical applications that they process.
  • 6. Adjunct to the EDW With this new found wealth of data, we’re also experiencing a cultural shift toward democratization of data. Leading organizations are now saying, “Here, have some. Let’s let everybody have access and see what they can do with it.” This is due to the growing recognition that the more an organization can harness information, the greater the value they derive from deeper insights. For this reason, organizations are removing blocks to innovation and transforming the way data contributes to success. Serving as an adjunct to the EDW, a Data Lake can: • Work in tandem with the EDW and allow you to offload colder data to the Data Lake. • Allow you to work with unstructured data. • Support a cultural shift towards democratized data access. • Contain costs while continuing to do more with more data. Sounds compelling, doesn’t it? But how do you know if you are ready for a Data Lake? Determining Readiness: Some Questions to Ask Here are some of the critical drivers that indicate readiness: • Are you working with a growing amount of unstructured data? • Are your lines of business demanding even more unstructured data? • Does your organization need a unified view of information? • Do you need to be able to perform real-time analysis on source data? • Is your organization moving toward a culture of democratized data access? • Would your organization benefit from elasticity of scale? Design Principles If you are ready for a Data Lake, there are some key design principles that we recommend following. Here are our top five: • Discovery without limitations • Low latency at any scale • Movement from a reactive model to predictive model • Elasticity in infrastructure • Affordability One of the most significant reasons to build a Data Lake is to encourage experimentation and to move from an intuition-based model to a more comprehensive, empirical, data science driven model. In order to enable that kind of experimentation and analytical finesse to thrive, you have to allow for discovery without limitations. By that, we mean you have to be willing and able to give users access to all that data. You should also be able to perform low latency queries of data at any scale. Now let’s talk about what it takes to build a Data Lake. 6
  • 7. Not a Big Bang Approach: The Four Stages to Building a Data Lake From our experience, building a Data Lake doesn’t happen all at once; instead, there are stages of maturity: 1. Handling and ingesting data at scale 2. Building analytical muscle 3. Leveraging the strengths of the EDW and the Data Lake 4. Adopting broadly 1. Handling and Ingesting Data at Scale This first stage involves getting the architecture in place and learning to acquire and transform data at scale. This is when your organization will need to determine the new and existing data sources that it can leverage. These data sources are then integrated and the volume and variety of data is ingested at high velocity in Hadoop storage. At this stage, the analytics may be rather simple, consisting possibly of simple transformations, but it’s an important step in discovering how to make Hadoop work the way you want. 2. Building Analytical Muscle The second stage focuses on improving the ability to transform and analyze data. This is where you begin to really leverage the enterprise Data Lake. For example, your organization can start building batch, mini-batch, and real-time applications for enterprise usage, exploratory analytics, and predictive use cases. Various tools and frameworks are used at this stage. The EDW and the Data Lake start working together. 3. Leveraging the Strengths of the EDW and the Data Lake This is when the orchestra really starts to play. Here, in the third stage, you will want to get data and analytics into the hands of as many people in the organization as possible. Democratization begins. This is also the stage where the EDW and Hadoop-based Big Data lake truly co-exist, allowing the enterprise to leverage the strengths of each architecture. 4. Adopting Broadly The fourth level is the highest stage of maturity in the Data Lake. Enterprise capabilities are added to the Data Lake. Broad adoption of unified Data Lake architectures requires information governance, compliance, security, auditing, meta-data management and information lifecycle management capabilities. Not addressing these issues may result in slow enterprise adoption and runs the risk that eventually the Data Lake becomes a “data swamp.” The Big Data Lake: Understanding the Essentials Understanding the layers of the data warehouse is an imperative step in the Big Data journey. The following pages elucidate the components of a Big Data warehouse and the methodology to set it up. Components of Big Data Warehouse While requirements and specific business needs may vary within each organization, the following diagram lists the major components of a Big Data warehouse. 7
  • 8. 8 Data Sources An enterprise usually has the following sources of data: • A relational database such as Oracle, DB2, PostgreSQL, SQL Server and the like • Multiple disparate, unstructured and semi-structured data sources which may have data in formats such as flat files, XML, JSON or CSV • Existing systems may further provide integration data in EDI or other B2B exchange formats • Machine data and network elements generating huge volumes of data Hadoop Distribution Hadoop is the most popular choice for Big Data today and is available in open source Apache and commercial distribution packages. Hadoop consists of a file system called HDFS (Hadoop distributed file system) which forms the key data storage layer of the Big Data warehouse. Other options are also available such as GPFS (from IBM) and S3 (from Amazon cloud). Data Ingestion It is imperative to set up reliable and scalable data ingestion mechanisms to bring data in from data sources to the Hadoop file system. • For connecting relational database, the most popular option is Sqoop and database specific connectors • For streaming data, Apache Kafka and Flume are quite popular Figure 1: Big Data warehouse reference architecture by Impetus Relational Data (PostgreSQL, Oracle, DB2, SQL Server...) Flat Files/XML /JSON/CSV Exsiting Systems Machine Data/ Network Elements Data Sources Business Intelligence Machine Data Analysis Predictive & Statistical Analysis Data Discovery Visualization & reporting Kafka/Flume Sqoop/ Connectors Existing DI Tools REST | JDBC SOAP | Custom Streaming Data Ingestion Virtualization Federation Delivery Polyglot Mapper Management Provision Monitor Performance Optimization Security Data Quality Lifecycle Management Data Classification Information Policy Governance Data Query Relational Offload Engine (Hive/Pig/Drill/Spark SQL) Search Pipelines Cubes Access Data Store/ NoSQL
  • 9. 9 • Organizations that need to leverage streaming data sources to setup an entire topology of streaming source, ingestion, in flight transformation and data persistence would need to use one of the common CEP (Complex Event Processing) or streaming engines such as Apache Storm or StreamAnalytix. • Organizations that need to leverage their existing Data Integration (DI) connectors may need custom scripts to integrate using REST, SOAP or JDBC components. Data Query: For the data resident in HDFS, there are a multitude of query engines such as Pig, Hive, Apache Drill and Spark SQL. Many organizations, however, would prefer to re-use their SQL scripts and procedures written for their traditional enterprise data warehouse. Because they have already invested millions of dollars in the traditional SQL and PL/SQL engines, it is understandable that organizations want to explore mechanisms that will allow them to offload the data tables from relational data warehouses to a Big Data warehouse while keeping their querying/reporting scripts intact. Tools and solutions are available now from organizations like Impetus Technologies which help the enterprise to offload the expensive computing from relational data warehouses to Big Data warehouses without re-writing the entire processing layer. Data Stores Along with the HDFS, there is a trend to couple a data store or NoSQL database like HBase, Cassandra in the Big Data warehouse. These stores provides additional functions in the form of columnar database, schema less storage, querying, OLAP/OLTP and application integration. Access With data stored in the HDFS or NoSQL layer, organizations demand increasingly complex access requirements. These include features from the traditional world like search and cubes functions. There are also new tools which help manage complex pipelines of jobs where output of one query may be fed as input into another. Governance Ensuring data quality is the key reason for data governance in the Big Data warehouse. • While the aim of the Big Data warehouse is to offer a Data Lake integrated with all enterprise data sources, it is still essential to apply data quality regulations to ensure the Data Lake does not turn into a data swamp • Similarly, data users increasingly need to make sure that they are able to manage the data through its entire lifecycle • Classifying data based on various segments like business user group (for instance, marketing, risk management, operations etc.) ensures control and governance of the data • It is also imperative to define enterprise level information policies to avoid breaches and ensure control on the entire data warehouse
  • 10. 10 Stages for Setting up a Big Data Warehouse The journey to a Big Data warehouse is a multi-stage process. It requires selecting the right tools, keeping a clear vision, and following a process to lay out an effective and integrated data warehouse. The key stages for setting up Big Data warehouse broadly include the following: Stage One: Handle and Ingest Data at Scale As the first stage, the organization needs to determine the existing and new data source that it can leverage. • The data sources are integrated and the variety of voluminous data is ingested at high velocity in Hadoop storage • The incoming data may be of varied formats ranging from unstructured data, structured data, streaming data, machine geo-spatial time series data or external data sets like social media data Virtualization Organizations have found that despite their best intentions and use cases, they may have to deal with the coexistence of the enterprise data warehouse and the Big Data warehouse for a period of time. To ensure consistent results with appropriate polyglot querying, the federation of data and delivery mechanisms are essential. Management To provision and monitor the entire cluster, operations team need to have handy tools and dashboards for cluster management. It is not un-common to find engineers trying to diagnose performance of MapReduce jobs and queries in their quest for optimal speed and minimal resource consumption. Security is another key aspect of the warehouse with authentication and role based authorization behind defined gateways. Business Intelligence The goal of the warehouse is to achieve business insights and generate intelligence for the organization. To achieve that objective, business teams need to be empowered with various visualization and reporting tools. Data scientists can also help discover data patterns using predictive/statistical algorithms and machine data analytics. Figure 2: handle and ingest data at scale Streaming Unstructured Structured Machine Geospatial Time Series External Social Landing and Ingestion Big Data Storage
  • 11. 11 Stage Two: Build the Analytical Muscle In order to leverage the enterprise Data Lake in Hadoop, the organization builds batch, mini-batch and real time applications for enterprise usage, exploratory analytics and predictive use cases. Various tools and frameworks are utilized in this stage as organizations begin to: • Explore advanced querying engines starting with MapReduce, and moving onto Apache Spark, Flink etc. for interactive results • Build use cases for both batch and real time processing using streaming solutions like Apache Storm and StreamAnalytix • Build analytic applications for enterprise adoption, exploration, discovery and prediction Step Three: Enterprise Data Warehouse and Big Data Warehouse Work in Unison In a real world scenario, enterprise data warehouse (EDW) and Hadoop based data warehouse (BDW) would co-exist as follows: • The organization would leverage data and specific capabilities to its advantage • Rather than disposing of the expensive enterprise warehouse, organizations like to leverage that along with Big Data technologies • Once a stable and mature Big Data warehouse is achieved, both the EDW and BDW work in unison to achieve multi work-load distribution and offload to each other as required • Specialized solutions like Impetus relational offload solution aid in helping organizations save millions of dollars with superior time and schedule benefits Figure 3: Build the analytical muscle Streaming Unstructured Structured Machine Geospatial Time Series External Social Landing and Ingestion Big Data Storage Real-Time Applications Provisioning,Workflow, Monitoring and Security Enterprise Applications Exploration & Discovery Predictive Applications
  • 12. 12 Step Four: Achieving Enterprise Maturity in the Warehouse For a unified data warehouse, various enterprise ready capabilities are needed. These are particularly pertinent in the case of information governance, metadata management and information lifecycle management. While organizations may begin with basic governance paradigms, as they mature in the journey it becomes essential to have more sophisticated practices and policies. Further, the appetite of the user is no longer satisfied by simply exploring data and managing it through its lifespan. Instead, organizations need tools and utilities to handle the laborious tasks of discovering data, proving insights and managing the information lifecycle. Figure 4: Enterprise data warehouse and Big Data warehouse work in unison Streaming Unstructured Structured Machine Geospatial Time Series External Social Enterprise Applications Exploration & Discovery Predictive Applications Landing and Ingestion Big Data Storage Real-Time Applications Provisioning, Workflow, Monitoring and Security Traditional Data Repositories RDBMS MPP Figure 5: achieving enterprise maturity in the warehouse Streaming Unstructured Structured Machine Geospatial Time Series External Social Enterprise Applications Exploration & Discovery Predictive Applications Landing and Ingestion Big Data Storage Real-Time Applications Provisioning, Workflow, Monitoring and Security Traditional Data Repositories RDBMS MPP Governance, Information Lifecycle, Enterprise Meta Data Management
  • 13. Summary • Success in the business world today depends on high-quality, accessible information that offers actionable insight • The emergence of the Data Lake is inspired by the need to manage and exploit new types of data • Organizations need new architectures capable of integrating both structured and unstructured data, managing massive data sets, and delivering real-time analytics • Hadoop based Big Data architectures have changed the face of the data warehouse and business intelligence and analytics world forever • There is a growing acceptance of the concept of a Data Lake as a cornerstone component of an enterprise Big Data strategy • Enterprise adoption of Big Data architectures is accelerating as a way to enable broad new opportunities across all functions and industries • Big Data warehouse architectures will complement rather than replace the enterprise data warehouses of today • As the enterprise data warehouse and the Data Lake work together in unison, they provide a synergy of capabilities, ultimately allowing analysts to do more with data and derive business results faster Conclusion With each new technological leap, a buzz of excitement and uncertainty surrounds it. The Big Data technologies and Hadoop ecosystem in particular seem to have captured the imagination across the IT landscape. However, the journey to a Big Data warehouse requires adequate planning and vision combined with robust engineering and technology practices. Smart, conscientious Data Lake development can drive greater value to and from a company’s data, while tapping the incredible power of innovation to drive real insight. Impetus Technologies specializes in this niche area and has unraveled the mysteries surrounding the Big Data warehouse for many customers and continues to be at top of the continuously evolving ecosystem. With products, solutions and services expertise available at every required stage of the Big Data journey, enterprise adopters can collaborate with niche providers to weigh in the options and chart out a visionary path in their industry segment. Big Data provides an unparalleled opportunity to turn information into a competitive asset, yielding revenues for business and glory for IT like never before. © 2015 Impetus Technologies, Inc. All rights reserved. Product and company names mentioned herein may be trademarks of their respective companies. June 2015 Impetus is focused on creating big business impact through Big Data Solutions for Fortune 1000 enterprises across multiple verticals. The company brings together a unique mix of software products, consulting services, Data Science capabilities and technology expertise. It offers full life-cycle services for Big Data implementations and real-time streaming analytics, including technology strategy, solution architecture, proof of concept, production implementation and on-going support to its clients. Visit http://impetus.com or write to us at bigdata@impetus.com About Impetus