1. 1
DATA LAKES
Big Data Requires a Big, New Architecture
Şaban Dalaman
İstanbul Şehir University, İstanbul, Turkey
sabandalaman@std.sehir.edu.tr
Abstract— what is a data lake? How does it help with the
challenges appearing with big data? How is it related to the current
enterprise data warehouse? How will the data lake and the
enterprise data warehouse be used together? How can you get
started on the journey of incorporating a data lake into your
architecture?
Index Terms— Apache Hadoop, Data Lake, Big Data
I. INTRODUCTION
The concept of a data lake is emerging as a popular way to
organize and build the next generation of systems to master
new big data challenges. It is not Apache™ Hadoop® but the
power of data that is expanding our view of analytical
ecosystems to integrate existing and new data into what called
as a logical data warehouse. As an important component of
this logical data warehouse, companies are seeking to create
data lakes because they manage and use data with increased
volume, variety, and a velocity rarely seen in the past. But what
is a data lake? How does it help with the challenges posed by
big data? How is it related to the current enterprise data
warehouse? How will the data lake and the enterprise data
warehouse be used together? How can you get started on the
journey of incorporating a data lake into your architecture?
RE-THINKING REPOSITIORIES[1]
• The massive explosion of sources of information
• How to take maximum advantage of big data?
• In the world of big data, we don’t really know what
value the data has when it’s initially accepted from
the array of sources available to us.
• IT is going to have to press the re-start button on its
architecture for acquiring and understanding
information.
• IT will need to construct a new way of capturing,
organizing and analyzing data,
Big data stands no chance of being useful if people attempt
to process it using the traditional mechanisms of business
intelligence, such as a data warehouses and traditional data-
analysis techniques
II. HISTORY[2]
The term was coined by James Dixon, Pentaho chief
technology officer.
Dixon used the term initially to contrast with "data mart",
which is a smaller repository of interesting attributes extracted
from the raw data.
He says in short "If you think of a datamart as a store of
bottled water – cleansed and packaged and structured for easy
consumption – the data lake is a large body of water in a more
natural state. The contents of the data lake stream in from a
source to fill the lake, and various users of the lake can come
to examine, dive in, or take samples."
Dixon identified 2 shortcomings of conventional datamarts:
"Only a subset of the attributes are examined, so only pre-
determined questions can be answered." and "The data is
aggregated so visibility into the lowest levels is lost."
Therefore, storing data in some “optimal” form for later
analysis doesn’t make any sense. Instead, what the Dixon
suggests is storing the data in a massive, easily accessible
repository based on the cheap storage that’s available today.
Then, when there are questions that need answers, which is
the time to organize and sift through the chunks of data that
will provide those answers. Determine the structure of the
data at the time of search, not at the time of storage.
III. DEFINITION[3]
Wikipedia: A data lake is a large storage repository and
processing engine, they provide "massive storage for any kind
of data, enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs".
Gartner: A data lake is a collection of storage instances of
various data assets additional to the originating data sources.
2. 2
These assets are stored in a near-exact, or even exact, copy of
the source format.
Techtarget: A data lake is a storage repository that holds a vast
amount of raw data in its native format until it is needed.
Microsoft: Data Lake - Batch, real-time and interactive
analytics made easy.
EMC2: Data Lake Foundation gives you a single system to
capture, store, analyse, protect and manage your data.
Capgemini: Discover a new approach to addressing your
company’s information challenges. Embracing Big Data
satisfies both local and corporate needs from an integrated
environment. We call it the Business Data Lake.
Cognizant: Your mission (whether or not you accept it) is to not
only manage the sheer bulk of data, but to also draw meaning
from the bits and bytes. This requires going way beyond
traditional data repositories to what we call the data lake. You
won't be able to afford the time, effort and cost of loading all
this data into a big data repository, nor could you easily find
and use the data you need in it.
As you can see, there is no generally accepted definition for
Data Lake.
DATA LAKES: It’s a concept, not a place
We may overcome this confusion by putting what are
priciples for a DL
A data lake is a storage repository that holds a vast amount
of raw data in its nativeformat until it is needed. While a
hierarchical data warehouse stores data in files or folders, a
data lake uses a flat architecture to store data. Each data
element in a lake is assigned a unique identifier and tagged
with a set of extended metadata tags. When a business
question arises, the data lake can be queried for relevant data,
and that smaller set of data can then be analyzed to help
answer the question.
Like big data, the term data lake is sometimes disparaged as
being simply a marketing label for a product that supports
Hadoop. Increasingly, however, the term is being accepted as
a way to describe any large data pool in which the schema and
data requirements are not defined until the data is queried.
The problem is that, in the world of big data, we don’t really
know what value the data has when it’s initially accepted from
the array of sources available to us. We might know some
questions we want to answer, but not to the extent that it
makes sense to close off the ability to answer questions that
materialize later. Therefore, storing data in some “optimal”
form for later analysis doesn’t make any sense. Instead, what
it is suggested is storing the data in a massive, easily accessible
repository based on the cheap storage that’s available today.
Then, when there are questions that need answers, that is the
time to organize and sift through the chunks of data that will
provide those answers.
The Business Data Lake changes the way IT looks at
information in a traditional EDW approach. It embraces the
following new principles[9]:
Land all the information you can as is with no modification
Encourage LOB to create point solutions
Let LOB decide on the cost/performance for their problem
Concentrate governance on the critical points only
Consider the corporate view to be just another LOB view
Unstructured information is still information
Never assume the lake contains everything
Scale is driven by demands – scale down as well as up
These new principles drive a new approach, one that
delivers what IT needs – a cost effective solution in a way that
leverages the business need for local views.
3. 3
IV. FOUR CHALLENGES OF DATA LAKES[8]
Meta Data Management
A data lake is only truly valuable to an organization if its data
is tagged and catalogued. Unfortunately, applying the right
metadata at the right moment to the right data within the data
lake can be a challenge.
Data Governance
Data governance is a challenge for any organization dealing
with data in general and big data specifically.
Data Preparation
Ensure proper dealing and preparation with the data
Data Security
Having all data in one central location, security becomes an
issue
V.BENEFITS OF THE BUSINESS DATA LAKE[8]
A Business Data Lake is a storage area for all data sources.
Data can be pulled/pushed directly from the data sources
into the Storage Area. All data in raw form are available in
one place.
Limitations on the data volumes and storage cost are
significantly reduced through the use of commodity
hardware.
Once all data is brought into the Lake, users can pull
relevant data for analysis. They can analyse and derive
new insights from the data without knowing its initial
structure. APIs that search the data structures in the
Business Data Lake and provide the metadata information
are currently being created. These APIs play a key role in
deriving new insights from ad hoc data analysis.
As new data sources get added to the environment, they
can simply be loaded into the Business Data Lake and a
data refinement/enrichment process created, based on
the business need.
The main drawback of creating a data model up-front is
eliminated. Traditional data modelling, which is done up-
front, fails in a Big Data environment for two reasons: the
nature of the incoming data and the limitation on the
analysis that it allows. The Business Data Lake overcomes
these two limitations by providing a loosely coupled
architecture that enables flexibility of analysis.
Based on repetitive requirements, relevant subject areas
that are used frequently for standard / canned reports can
be loaded into the data warehouse in a dimensional form
and the rest of the data can continue to reside inside the
Business Data Lake for analytics on need.
A data governance framework can be built on top of the
Business Data Lake for relevant enterprise data. This
framework can be extended to additional data based on
requirements.
The Business Data Lake meets local business requirements
as well as enterprisewide needs from the same data store.
The enterprise view of the data can be considered as
another local view.
Being able to move data across from the sources and turn
it around quickly to derive business outcomes is key to the
success of a Business Data Lake, an area where traditional
BI implementations fail to meet business needs.
4. 4
VI. Architecture Comparison — Traditional BI and
Business Data Lake
Figure 1. [9]
As we see from figure-1, a Business Data Lake is able to:
• Receive and store high volume and volatile structured,
semi-structured and unstructured data in near-real time using
low cost commodity hardware
• Provide a platform to perform near-real time analytics and
business processing on the data in the lake
• Provide a business view that is tailored to specific LOBs as
well the enterprise.
The Business Data Lake does this in a way which enables
users to reduce the business solution implementation time,
by:
• Eliminating the dependency of data modelling up-front
and thereby letting all data flow in
• Reducing the time taken to build robust ETL process to load
the data into the structured data stores, which are bound to
change
• Eliminating an over-engineered metadata layer
• Providing the capability to view the same data in different
dimensions and derive new patterns and relationships that lie
within the data.
Figure 2. [9]
Figure 3.[9]
5. 5
VII. Some examples of Data Lake architectures
Business Data Lake Architecture – Pivotal[6]
Business Data Lake Architecture – Microsoft[5]
Federation Business Data Lake – EMC[11]
Teradata – Hortonworks[13]
As can be easily seen from examples, major players from
market have some kind of solution for data lake architectures.
They are similar in structure but providing different kind of
products for different components of Data Lake.
The most important part is the data ingestion solutions. Here
companies should provide for storing data without losing any
valuable asset.
The next key part of the Business Data Lake is the concept of
distillation. This is where the business creates maps onto the
source data histories contained in the Data Storage tier to
generate the view that matches their current requirements.
The goal here should be to enable the business to extract
any information they are allowed to: privacy and security can
be enforced through the distillation process. These maps can
be reused by others or just discarded, as can the point
information solutions if required.
By providing the business with access to all of the raw
information, operational reporting systems can now be
created in the same environment as long-term financial
planning and corporate reporting. Critically, this removes the
business need to create point solutions.
PERSONAL DATA LAKE ARCHITECTURE
Personal Data Lake[12]
We may see a future in which each individual has their own
Personal Data Lake that stores all the digital data accumulated
in their lifetime -- emails, photos, medical records, invoices,
bills, payments, certificates, phone calls, to name just a few
examples. Although it is intuitive to trust an individual to take
care of their own data like they do with their physical
belongings, it requires a fundamental shift in how we
6. 6
handle data and build the economy on top of it. Figure
illustrates the two different personal data pathways.
The Personal Data Lake research reported in this paper was
initiated late last year. The following points support the
principles discussed here for building such a lake.
• Data privacy and security is at the heart of building a
personal data storage utility to empower personal users with
full control over their data, as well as to benefit the community
(in an tightly controllable manner)
• A data lake is an optimum storage solution for personal
data because of the 3V nature of personal data.
• A successful data lake relies on a successful metadata
management system, as well as on a data
processing/analysis/query system
This project is still at the early stage of implementation. In
the near future we are going to see the solution for personal
use.
VI. CRITISIM
Customers creating big data graveyards, dumping
everything into HDFS and hoping to do something with it
down the road. But then they just lose track of what’s
there
"The main challenge is not creating a data lake, but taking
advantage of the opportunities it presents."[15]
In June 2015 David Needle characterized "so-called data
lakes" as "one of the more controversial ways to
manage big data".[14]
“We see customers creating big data graveyards, dumping
everything into HDFS and hoping to do something with it
down the road. But then they just lose track of what’s
there”.[15]
“Gartner Says Beware of the Data Lake Fallacy”[16]
"Summary: A Data Lake is not a data warehouse housed in
Hadoop. If you store data from many systems and join across
them, you have a Water Garden, not a Data Lake. "
James Dixon
REFERENCES
[1] Woods, Dan http://www.forbes.com/sites/ciocentral/2011/07/21/big-
data-requires-a-big-new-architecture/ 2011
[2] Dixon, James https://jamesdixon.wordpress.com/2014/09/25/data-
lakes-revisited/ 2014
[3] Chinnakali, Kumar
http://www.datasciencecentral.com/profiles/blogs/the-collective-
definition-of-data-lake-by-big-data-community, 2015
[4] EMC2, Data Lakes for Big Data,2015
[5] https://azure.microsoft.com/en-us/solutions/data-lake/ 2015
[6] Pivoltal, http://www.slideshare.net/capgemini/detection-of-anomalous-
behavior-41986267 2014
[7] Kelly, Thomas, PMP http://www.slideshare.net/ThomasKellyPMP/the-
emerging-data-lake-it-strategy?next_slideshow=1 2014
[8] Rijmenam, Mark van https://datafloq.com/read/Data-Lakes-Open-
Possibilities-Your-Organization/1695 2015
[9] Capgemini, The Principles of the Business Data Lake 2015
[10] Capgemini, Traditional BI vs. Business Data Lake –
A Comparison 2015
[11] EMC2, Federation Business Data Lake – Enabling Comprehensive Data
Services 2015
[12] Alrehamy, Hassan Walker, Coral, Personal Data Lake With Data Gravity
Pull 2015
[13] CITO Research Teradata Hortonworks; Putting the Data Lake to Work
A Guide to Best Practices 2014
[14] Needle, David http://www.eweek.com/enterprise-apps/hadoop-summit-
wrangling-big-data-requires-novel-tools-techniques-2.html 2015
[15] Stein, Brian; Morrison, Alan. Data lakes and the promise of unsiloed data
(Report). Technology Forecast: Rethinking integration.
PricewaterhouseCooper 2014
[16] http://www.gartner.com/newsroom/id/2809117