Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Lakes versus Data Warehouses


Published on

This paper describes the concept of a data lake and how it compares to a data warehouse. A review recent research and discussion of the definition of both repositories, what types of data are catered for? Does ingesting data make it available for forging information and beyond
into knowledge? What types of people, process and tools need to be involved to realise the
benefits of using a data lake?

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Lakes versus Data Warehouses

  1. 1. CA1 Literature Review: Data Lakes
  2. 2. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 1   Are Data Lakes the new Data Warehouse? Can data lakes provide an organisation with a radical approach to harnessing data, discovering information and acquiring knowledge, based on Golfarelli’s Business Intelligence (BI) definition of data, information and knowledge? Introduction This paper describes the concept of a data lake and how it compares to a data warehouse. We review recent research and discuss the definition of both repositories, what types of data are catered for? Does ingesting data make it available for forging information and beyond into knowledge? What types of people, process and tools need to be involved to realise the benefits of using a data lake? Data Lakes and Data Warehouse? Sharma (2016) points out that organisations are facing a barrage of data, generated internally and externally (especially via internet based platforms). Data generation continues to accelerate, the breadth of unstructured and semi-structured data is in step with this acceleration. Current systems and methodologies need to change and adapt to the demands of big data processing. Two areas impacted are the data lake and data warehouse which are described below and in Figure 1. Halter et al. (2016) suggest that a data lake provides an alternative way to store high volumes of data in its native format (be that unstructured, semi-structured or structured) at relatively low storage costs. The data schemas are unknown when data is loaded, but are revealed as data in the lake is accessed. O'Leary (2014) describes a data warehouse as a bolt-on to existing operational systems, consisting of structured data associated with a specific user base and a specific set of predefined business queries. The data schema is predefined and structured to facilitate regular queries. Populating the data warehouse requires multiple extract, transformation and load (ETL) processes which are also designed in advance.
  3. 3. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 2   Aspect Data Lake Data Warehouse Data Sources Many Few Data types Unstructured Semi-structured Structured Structured Schema required on Load No, data loaded without knowledge of data schema Yes, data schema known prior to load Set-up and configuration Low implementation cost with open source components Specialist skills may be scarce High cost of proprietary software licenses, design, development and maintenance Near real time data Yes, time between data load and explore is far shorter Poor, data tends to have historic profile. Data only available once ETL jobs have completed Ad hoc query Yes, queries authored at run time No, questions asked in advance, structure must support query. Queries authored at design time. Flexible support for cross organisational questions / analysis Correct approach provides a variety of result sets for a wider and diverse audience Poor, inflexible predefined structures only support specific demands of a known user base Figure 1: Key aspects of data lakes and data warehouses based on O'Leary (2014) and Watson (2015). Harnessing Data Taking opinion and understanding gained from conference discussions focused on data lakes, Watson (2015) considers that a data lake is sometimes used as a precursor data store. Such a store is capable of ingesting copious amounts of unstructured, semi-structured and structured data, whilst the format of the data is retained. The above suggests that multiple data type capture is possible, and ties in with the definition above on data type and raw format preservation. However, it is not clear that amassing data is actually harnessing data. Fitzgerald (2015) in an interview with General Electric covering their experience of an operational data lake, notes that at the point of ingestion the data schema is unknown. The outcome of how data will be used in downstream processes and whether it will add value is not yet apparent. Industry case studies conducted by Halter et al. (2016) further suggest that, the data lake is a viable staging candidate for data warehouse input, for example, when processing unstructured real time data sourced from the internet, data streams and social media.
  4. 4. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 3   Discovering Information Through studies and exploration of the concept of big data, Sharma (2016) confers that the data lake can provide a rich source of data for rudimentary exploration by skilled data scientists and analysts. Fitzgerald (2015) found that General Electric saw 80% of their talented data scientists’ time was spent on wrangling data into useful information rather than building models for exploring the outcomes. This indicates the importance of correct resource allocation in order to glean information from data whilst keeping costs within acceptable business limits. In an exploration of industry and academic approaches to BI, data warehousing and big data, O'Leary (2014) discusses the use of Master Data Management to help mitigate common data issues. For instance, data inconsistencies appear due to multiple data sources and data redundancy occurs owing to multiple copies of the same data item. Identifying master data and its fitness for purpose provides clarity for the organisation including the multiple applications which rely on data to be consistent. Creation of meaningful metadata attached to cleansed lake data assists information discovery. Sharma (2016) suggests that it is plausible to turn a raw data lake into a “smart” data lake through the use of semantic graph models. Adding context to data facilitates awareness and usability, which gives rise to information. Acquiring Knowledge Halter et al. (2016) confer that a data lake may present an organisation with a competitive advantage. This means being capable of conducting data analytics and forming insights to assist business decision making via the acquisition of meaning from disparate data sources. Taking a business perspective, it is worthwhile discussing and forming processes with business decision makers to define what data to populate the lake with in the first place (Watson, 2015). This in turn provides the scope on which to start the search for information, culminating in knowledge acquisition. Folding big data in with traditional organisational data for modern data analytics requires the use of new forms of technology designed specifically to bring about desired results. Based on the speed, amount and mix of data in this context, existing systems will need to adapt or be replaced. Queries required to produce sought-after outcomes may well be searching for data which does not exist in the data warehouse according to Watson (2015). Similar big data pressure to adapt to change is also recognised in Sharma (2016). Evaluation A data lake may not be a panacea for resolving the data issues mentioned above, but it is a technique that could complement the data warehouse. Both have different underlying structural requirements and a varied user base which require a varied skillset in order to extract value from both services (Halter et al., 2016). However, the temptation of lower entry cost, emerging tool combinations that contextualise data and the expectation of a flexible and usable way to deal with the surge of big data may attract organisations to build data lakes. Figure 2 illustrates possible associations between people, process and tools as part of this evaluation. Examining the suitability of emerging tools in an industry case study, Armstrong and Barnes (2016) suggests that Hadoop is a common tool of choice for data lakes due to its low cost of entry and ability to soak up a wide variety of unstructured and
  5. 5. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 4   semi-structured data. Tools such as Hadoop combined with NoSQL (Halter et al., 2016) will facilitate early adopters of data lakes. Figure 2: People, Process and Tools, based on information compiled from Golfarelli (2004); Watson (2015) and Fitzgerald (2015). Further research is required around which emerging tools increase data lake access, usability, interrogation and security. Skilled resource cost is a common thread, clear role definition and people management should be examined further to avoid wasteful resource deployment. People are required to maintain and administer Hadoop based systems, probe the data lake, identify valuable data for input to downstream experimentation, discovery and proof of concept generation. Processes also require attention, the risk and impact of new legislation together with, as suggested by Fitzgerald (2015), gaining a deeper understanding of governance, provenance and how data is managed when at rest or in transit across boundaries. A large investment already made in existing data warehouse architecture and ETL implementations may preclude the adoption of data lakes. Evidence comparing return on investment for typical data lake and warehouse use cases is an appealing area for further research. However, according to Armstrong and Barnes (2016), as tools in this space evolve, use of sandboxes and selective migration of ETL processes into the data lake provide meaningful feedback to support proof of concept efforts. If the goal is a unified, consolidated master data store which fully supports integrated disparate data, capable of serving various levels of analytics (e.g. real time, predictive and historical) across the entire organisation? Then data lakes could be the first step on that journey. Its implementation requires skilled resources that create consistent metadata and data modelling to ensure meaningful outcomes (O'Leary, 2014). The project requires a business driven strategy (Halter et al., 2016), buy-in by senior management to align priorities and to connect the technology road map to defined business objectives (Armstrong and Barnes, 2016).
  6. 6. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 5   Bibliography Armstrong, R. and Barnes, S. (2016) ‘When It's Time to Hadoop’, Business Intelligence Journal, Volume 21, Issue 1, pp. 32-38. Fitzgerald, M. (2015), ‘Gone Fishing - for Data’, MIT Sloan Management Review, Volume 56, Issue 3, pp. 1-5. Golfarelli, M., Rizzi, S. and Cella, I. (2004). Beyond data warehousing: what's next in business intelligence? in ‘Proceedings of the 7th ACM international workshop on Data warehousing and OLAP’, DOLAP ’04. Washington, DC, USA,12-13 November, 2004, pp. 1-6. Halter, O. and Kromer, M. (2016), ‘Dipping a Toe into Data Lake’, Business Intelligence Journal, Volume 21, Issue 2, pp. 40-46. O'Leary, D. E. (2014), ‘Embedding AI and Crowdsourcing in the Big Data Lake’, IEEE Intelligent Systems, Volume 29, Issue 5, pp. 70-73. Sharma, S. (2016), ‘Expanded cloud plumes hiding Big Data ecosystem’, Future Generation Computer Systems, Volume 59, pp. 63-92. Watson, H. J. (2015), ‘Data Lakes, Data Labs, and Sandboxes’, Business Intelligence Journal, Volume 20, Issue 1, pp. 4-7.