This paper describes the concept of a data lake and how it compares to a data warehouse. A review recent research and discussion of the definition of both repositories, what types of data are catered for? Does ingesting data make it available for forging information and beyond
into knowledge? What types of people, process and tools need to be involved to realise the
benefits of using a data lake?
2. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 1
Are Data Lakes the new Data Warehouse?
Can data lakes provide an organisation with a radical approach to harnessing data,
discovering information and acquiring knowledge, based on Golfarelli’s Business
Intelligence (BI) definition of data, information and knowledge?
Introduction
This paper describes the concept of a data lake and how it compares to a data warehouse.
We review recent research and discuss the definition of both repositories, what types of data
are catered for? Does ingesting data make it available for forging information and beyond
into knowledge? What types of people, process and tools need to be involved to realise the
benefits of using a data lake?
Data Lakes and Data Warehouse?
Sharma (2016) points out that organisations are facing a barrage of data, generated internally
and externally (especially via internet based platforms). Data generation continues to
accelerate, the breadth of unstructured and semi-structured data is in step with this
acceleration. Current systems and methodologies need to change and adapt to the demands
of big data processing. Two areas impacted are the data lake and data warehouse which are
described below and in Figure 1.
Halter et al. (2016) suggest that a data lake provides an alternative way to store high volumes
of data in its native format (be that unstructured, semi-structured or structured) at relatively
low storage costs. The data schemas are unknown when data is loaded, but are revealed as
data in the lake is accessed.
O'Leary (2014) describes a data warehouse as a bolt-on to existing operational systems,
consisting of structured data associated with a specific user base and a specific set of
predefined business queries. The data schema is predefined and structured to facilitate
regular queries. Populating the data warehouse requires multiple extract, transformation and
load (ETL) processes which are also designed in advance.
3. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 2
Aspect Data Lake Data Warehouse
Data Sources Many Few
Data types Unstructured
Semi-structured
Structured
Structured
Schema required on
Load
No, data loaded without
knowledge of data schema
Yes, data schema known prior
to load
Set-up and
configuration
Low implementation cost with
open source components
Specialist skills may be scarce
High cost of proprietary
software licenses, design,
development and maintenance
Near real time data Yes, time between data load and
explore is far shorter
Poor, data tends to have
historic profile. Data only
available once ETL jobs have
completed
Ad hoc query Yes, queries authored at run
time
No, questions asked in
advance, structure must
support query.
Queries authored at design
time.
Flexible support for
cross organisational
questions / analysis
Correct approach provides a
variety of result sets for a wider
and diverse audience
Poor, inflexible predefined
structures only support specific
demands of a known user base
Figure 1: Key aspects of data lakes and data warehouses based on O'Leary (2014) and
Watson (2015).
Harnessing Data
Taking opinion and understanding gained from conference discussions focused on data
lakes, Watson (2015) considers that a data lake is sometimes used as a precursor data store.
Such a store is capable of ingesting copious amounts of unstructured, semi-structured and
structured data, whilst the format of the data is retained. The above suggests that multiple
data type capture is possible, and ties in with the definition above on data type and raw
format preservation. However, it is not clear that amassing data is actually harnessing data.
Fitzgerald (2015) in an interview with General Electric covering their experience of an
operational data lake, notes that at the point of ingestion the data schema is unknown. The
outcome of how data will be used in downstream processes and whether it will add value is
not yet apparent. Industry case studies conducted by Halter et al. (2016) further suggest that,
the data lake is a viable staging candidate for data warehouse input, for example, when
processing unstructured real time data sourced from the internet, data streams and social
media.
4. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 3
Discovering Information
Through studies and exploration of the concept of big data, Sharma (2016) confers that the
data lake can provide a rich source of data for rudimentary exploration by skilled data
scientists and analysts. Fitzgerald (2015) found that General Electric saw 80% of their
talented data scientists’ time was spent on wrangling data into useful information rather than
building models for exploring the outcomes. This indicates the importance of correct
resource allocation in order to glean information from data whilst keeping costs within
acceptable business limits.
In an exploration of industry and academic approaches to BI, data warehousing and big data,
O'Leary (2014) discusses the use of Master Data Management to help mitigate common data
issues. For instance, data inconsistencies appear due to multiple data sources and data
redundancy occurs owing to multiple copies of the same data item. Identifying master data
and its fitness for purpose provides clarity for the organisation including the multiple
applications which rely on data to be consistent. Creation of meaningful metadata attached
to cleansed lake data assists information discovery. Sharma (2016) suggests that it is
plausible to turn a raw data lake into a “smart” data lake through the use of semantic graph
models. Adding context to data facilitates awareness and usability, which gives rise to
information.
Acquiring Knowledge
Halter et al. (2016) confer that a data lake may present an organisation with a competitive
advantage. This means being capable of conducting data analytics and forming insights to
assist business decision making via the acquisition of meaning from disparate data sources.
Taking a business perspective, it is worthwhile discussing and forming processes with
business decision makers to define what data to populate the lake with in the first place
(Watson, 2015). This in turn provides the scope on which to start the search for information,
culminating in knowledge acquisition.
Folding big data in with traditional organisational data for modern data analytics requires the
use of new forms of technology designed specifically to bring about desired results. Based
on the speed, amount and mix of data in this context, existing systems will need to adapt or
be replaced. Queries required to produce sought-after outcomes may well be searching for
data which does not exist in the data warehouse according to Watson (2015). Similar big data
pressure to adapt to change is also recognised in Sharma (2016).
Evaluation
A data lake may not be a panacea for resolving the data issues mentioned above, but it is a
technique that could complement the data warehouse. Both have different underlying
structural requirements and a varied user base which require a varied skillset in order to
extract value from both services (Halter et al., 2016). However, the temptation of lower entry
cost, emerging tool combinations that contextualise data and the expectation of a flexible
and usable way to deal with the surge of big data may attract organisations to build data
lakes. Figure 2 illustrates possible associations between people, process and tools as part
of this evaluation. Examining the suitability of emerging tools in an industry case study,
Armstrong and Barnes (2016) suggests that Hadoop is a common tool of choice for data
lakes due to its low cost of entry and ability to soak up a wide variety of unstructured and
5. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 4
semi-structured data. Tools such as Hadoop combined with NoSQL (Halter et al., 2016) will
facilitate early adopters of data lakes.
Figure 2: People, Process and Tools, based on information compiled from Golfarelli (2004);
Watson (2015) and Fitzgerald (2015).
Further research is required around which emerging tools increase data lake access,
usability, interrogation and security. Skilled resource cost is a common thread, clear role
definition and people management should be examined further to avoid wasteful resource
deployment. People are required to maintain and administer Hadoop based systems, probe
the data lake, identify valuable data for input to downstream experimentation, discovery and
proof of concept generation. Processes also require attention, the risk and impact of new
legislation together with, as suggested by Fitzgerald (2015), gaining a deeper understanding
of governance, provenance and how data is managed when at rest or in transit across
boundaries. A large investment already made in existing data warehouse architecture and
ETL implementations may preclude the adoption of data lakes. Evidence comparing return
on investment for typical data lake and warehouse use cases is an appealing area for further
research. However, according to Armstrong and Barnes (2016), as tools in this space evolve,
use of sandboxes and selective migration of ETL processes into the data lake provide
meaningful feedback to support proof of concept efforts.
If the goal is a unified, consolidated master data store which fully supports integrated
disparate data, capable of serving various levels of analytics (e.g. real time, predictive and
historical) across the entire organisation? Then data lakes could be the first step on that
journey. Its implementation requires skilled resources that create consistent metadata and
data modelling to ensure meaningful outcomes (O'Leary, 2014). The project requires a
business driven strategy (Halter et al., 2016), buy-in by senior management to align priorities
and to connect the technology road map to defined business objectives (Armstrong and
Barnes, 2016).
6. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 5
Bibliography
Armstrong, R. and Barnes, S. (2016) ‘When It's Time to Hadoop’, Business Intelligence
Journal, Volume 21, Issue 1, pp. 32-38.
Fitzgerald, M. (2015), ‘Gone Fishing - for Data’, MIT Sloan Management Review, Volume 56,
Issue 3, pp. 1-5.
Golfarelli, M., Rizzi, S. and Cella, I. (2004). Beyond data warehousing: what's next in
business intelligence? in ‘Proceedings of the 7th ACM international workshop on Data
warehousing and OLAP’, DOLAP ’04. Washington, DC, USA,12-13 November, 2004, pp.
1-6.
Halter, O. and Kromer, M. (2016), ‘Dipping a Toe into Data Lake’, Business Intelligence
Journal, Volume 21, Issue 2, pp. 40-46.
O'Leary, D. E. (2014), ‘Embedding AI and Crowdsourcing in the Big Data Lake’, IEEE
Intelligent Systems, Volume 29, Issue 5, pp. 70-73.
Sharma, S. (2016), ‘Expanded cloud plumes hiding Big Data ecosystem’, Future Generation
Computer Systems, Volume 59, pp. 63-92.
Watson, H. J. (2015), ‘Data Lakes, Data Labs, and Sandboxes’, Business Intelligence
Journal, Volume 20, Issue 1, pp. 4-7.