This paper describes the concept of a data lake and how it compares to a data warehouse. A review recent research and discussion of the definition of both repositories, what types of data are catered for? Does ingesting data make it available for forging information and beyond
into knowledge? What types of people, process and tools need to be involved to realise the
benefits of using a data lake?
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
This article useful for anyone who want to introduce with Big Data and how oracle architecture Big Data solution using Oracle Big Data Cloud solutions .
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
This article useful for anyone who want to introduce with Big Data and how oracle architecture Big Data solution using Oracle Big Data Cloud solutions .
Massive Data Analysis- Challenges and ApplicationsVijay Raghavan
We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. We provide a brief overview of some of the state-of-the-art technologies in the massive data analysis landscape. Then, we describe two applications from two diverse areas in detail: recommendations in e-commerce, link discovery from biomedical literature. Finally, we present some challenges and open problems in the field of massive data analysis.
For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
In this paper, Impetus focuses at why organizations need to design an Enterprise Data Warehouse (EDW) to support the business analytics derived from the Big Data.
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper discusses how the traditional batch and real time paradigm can work together to deliver smarter, quicker and better insights on large volumes of data picking the right strategy and right technology.
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
Big Data is a data analysis methodology enabled by recent advances in technologies and Architecture. Big data is a massive volume of both structured and unstructured data, which is so large that it's difficult to process with traditional database and software techniques. This paper provides insight to Big data and discusses its nature, definition that include such features as Volume, Velocity, and Variety .This paper also provides insight to source of big data generation, tools available for processing large volume of variety of data, applications of big data and challenges involved in handling big data
Big data is the term for any gathering of information sets, so expensive and complex, that it gets to be hard to process for utilizing customary information handling applications. The difficulties incorporate investigation, catch, duration, inquiry, sharing, stockpiling, Exchange, perception, and protection infringement. To reduce spot business patterns, anticipate diseases, conflict etc., we require bigger data sets when compared with the smaller data sets. Enormous information is hard to work with utilizing most social database administration frameworks and desktop measurements and perception bundles, needing rather enormously parallel programming running on tens, hundreds, or even a large number of servers. In this paper there was an observation on Hadoop architecture, different tools used for big data and its security issues.
This presentation is part of my work for the course 'Heterogeneous and Distributed Information Systems' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
Definition of classification
Basic principles of classification
Typical
How Does Classification Works?
Difference between Classification & Prediction.
Machine learning techniques
Decision Trees
k-Nearest Neighbors
Massive Data Analysis- Challenges and ApplicationsVijay Raghavan
We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. We provide a brief overview of some of the state-of-the-art technologies in the massive data analysis landscape. Then, we describe two applications from two diverse areas in detail: recommendations in e-commerce, link discovery from biomedical literature. Finally, we present some challenges and open problems in the field of massive data analysis.
For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
In this paper, Impetus focuses at why organizations need to design an Enterprise Data Warehouse (EDW) to support the business analytics derived from the Big Data.
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper discusses how the traditional batch and real time paradigm can work together to deliver smarter, quicker and better insights on large volumes of data picking the right strategy and right technology.
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
Big Data is a data analysis methodology enabled by recent advances in technologies and Architecture. Big data is a massive volume of both structured and unstructured data, which is so large that it's difficult to process with traditional database and software techniques. This paper provides insight to Big data and discusses its nature, definition that include such features as Volume, Velocity, and Variety .This paper also provides insight to source of big data generation, tools available for processing large volume of variety of data, applications of big data and challenges involved in handling big data
Big data is the term for any gathering of information sets, so expensive and complex, that it gets to be hard to process for utilizing customary information handling applications. The difficulties incorporate investigation, catch, duration, inquiry, sharing, stockpiling, Exchange, perception, and protection infringement. To reduce spot business patterns, anticipate diseases, conflict etc., we require bigger data sets when compared with the smaller data sets. Enormous information is hard to work with utilizing most social database administration frameworks and desktop measurements and perception bundles, needing rather enormously parallel programming running on tens, hundreds, or even a large number of servers. In this paper there was an observation on Hadoop architecture, different tools used for big data and its security issues.
This presentation is part of my work for the course 'Heterogeneous and Distributed Information Systems' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
Definition of classification
Basic principles of classification
Typical
How Does Classification Works?
Difference between Classification & Prediction.
Machine learning techniques
Decision Trees
k-Nearest Neighbors
Enterprise Data Lake:
How to Conquer the Data Deluge and Derive Insights
that Matters
Data can be traced from various consumer sources.
Managing data is one of the most serious challenges faced
by organizations today. Organizations are adopting the data
lake models because lakes provide raw data that users can
use for data experimentation and advanced analytics.
A data lake could be a merging point of new and historic
data, thereby drawing correlations across all data using
advanced analytics. A data lake can support the self-service
data practices. This can tap undiscovered business value
from various new as well as existing data sources.
Furthermore, a data lake can aid data warehousing,
analytics, data integration by modernizing. However, lakes
also face hindrances like immature governance, user skills
and security.
Hadoop was born out of the need to process Big Data.Today data is being generated liked never before and it is becoming difficult to store and process this enormous volume and large variety of data, In order to cope this Big Data technology comes in.Today Hadoop software stack is go-to framework for large scale,data intensive storage and compute solution for Big Data Analytics Applications.The beauty of Hadoop is that it is designed to process large volume of data in clustered commodity computers work in parallel.Distributing the data that is too large across the nodes in clusters solves the problem of having too large data sets to be processed onto the single machine.
Modern Integrated Data Environment - Whitepaper | QuboleVasu S
A whit-paper is about building a modern data platform for data driven organisations with using cloud data warehouse with modern data platform architecture
https://www.qubole.com/resources/white-papers/modern-integrated-data-environment
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
Abstract: Big data refers to the organizational data asset that exceeds the volume, velocity, and variety of data typically stored using traditional structured database technologies. This type of data has become the important resource from which organizations can get valuable insightand make business decision by applying predictive analysis. This paper provides a comprehensive view of current status of big data development,starting from the definition and the description of Hadoop and MapReduce – the framework that standardizes the use of cluster of commodity machines to analyze big data. For the organizations that are ready to embrace big data technology, significant adjustments on infrastructure andthe roles played byIT professionals and BI practitioners must be anticipated which is discussed in the challenges of big data section. The landscape of big data development change rapidly which is directly related to the trend of big data. Clearly, a major part of the trend is the result ofthe attempt to deal with the challenges discussed earlier. Lastly the paper includes the most recent job prospective related to big data. The description of several job titles that comprise the workforce in the area of big data are also included.
Closing the data source discovery gap and accelerating data discovery comprises three steps: profile, identify, and unify. This white paper discusses how the Attivio
platform executes those steps, the pain points each one addresses, and the value Attivio provides to advanced analytics and business intelligence (BI) initiatives.
Running head DATABASE AND DATA WAREHOUSING DESIGNDATABASE AND.docxtodd271
Running head: DATABASE AND DATA WAREHOUSING DESIGN
DATABASE AND DATA WAREHOUSING DESIGN 10
Database and Data Warehousing Design
Necosa Hollie
Dr. Ford
Information Systems Capstone CIS499
May 5, 2019
Introduction
Somar and Co. Data Collection Company collects and analyzes data by using operational systems and web analytics. The data used in the analysis is collected from diverse operating systems such as ERP software. Various applications such as payrolls, human resources, and insurance claims are used in, modern-day enterprises and data from them keep on increasing day by day (Schoenherr, & Speier‐Pero, 2015). The ever-increasing data has been overwhelming organizations’ ability to analyze it due to its complex nature. This challenge has forced Somar and Co. Data Collection Company to seek a solution to it to deliver quality results to their clients. As the chief information officer (CIO) at the company, will be in charge of designing the solution that will incorporate data warehousing. This will make it possible to be consolidating large amounts of data quickly and be creating quality analytical reports within the shortest time possible.
Need for Data Warehousing
Data warehouses are central storage systems in companies where vital information from other applications such as ERP system is deposited. The data is periodically extracted from these applications. Data is sent to the data warehouse in different formats as different applications have distinct ways of keeping information. Then the data warehouse by having a uniform operational system will process and analyze discrete data into a more straightforward form. Somar and Co. Data Collection Company manages data from various clients with the information having been collected from multiple departments such as marketing, sales, and finance. To develop an active data warehouse, data consistency from different applications plays a crucial part (Waller, & Fawcett, 2013). This enables establishing of a constant process for all types of data. The information is analyzed for analytical reports, market research and decision report. The processed data also gives insight about the direction of the company to the management. The data is considered by the management during decision making and strategic planning.
Due to the importance of the data reposted in the data warehouse to the management, it should be analyzed in such a way that it is easy to comprehend and interpret (Schoenherr, & Speier‐Pero, 2015). As the processed data originates from different departments of the organization, this makes it be a reliable source of information to the management. If every department were to analyze its data, this would result in different information in different formats hence tricky for the administration to interpret it accurately. The data warehouse helps to resolve this problem by offering a centralized syste.
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
In this paper, we first briefly review the concept of big data, including its definition, features, and value. We then present background technology for big data summarization brings to us. The objective of this paper is to discuss the big data summarization framework, challenges and possible solutions as well as methods of evaluation for big data summarization. Finally, we conclude the paper with a discussion of open problems and future directions..
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
In this paper, we first briefly review the concept of big data, including its definition, features, and value.
We then present background technology for big data summarization brings to us. The objective of this
paper is to discuss the big data summarization framework, challenges and possible solutions as well as
methods of evaluation for big data summarization. Finally, we conclude the paper with a discussion of open problems and future directions..
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
In this paper, we first briefly review the concept of big data, including its definition, features, and value.We then present background technology for big data summarization brings to us. The objective of this paper is to discuss the big data summarization framework, challenges and possible solutions as well as methods of evaluation for big data summarization. Finally, we conclude the paper with a discussion of open problems and future directions..
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
In this paper, we first briefly review the concept of big data, including its definition, features, and value.
We then present background technology for big data summarization brings to us. The objective of this
paper is to discuss the big data summarization framework, challenges and possible solutions as well as
methods of evaluation for big data summarization. Finally, we conclude the paper with a discussion of
open problems and future directions..
An unsupervised learning
k-means clustering technique is used to identify and focus
on a set of crimes (prostitution, narcotics, burglary, battery
and interference with a public officer) recorded in the city
of Chicago. The crime data is supplemented with orthogonal
temperature and unemployment data. ANOVA and Kruskal-
Wallis statistical tests assess the temporal significance in
crimes clusters. The findings indicate various crime hot spots
which are temporal and location specific, and therefore may
act as input to the scheduling and allocation of policing resources.
The Prepared Executive: A Linguistic ExplorationTom Donoghue
An exploration of linguistic features in executive answers which investigates how these features might be proxies for detecting executive preparedness. The motivation is to discover hitherto unknown facets of the executive through their language.
A domain expert participates in an experiment to annotate a sample of executive answers providing a valuable ground truth. A set of models are adapted and combined to test their accuracy at predicting whether an executive's answer is unprepared.
Crime Analysis using Regression and ANOVATom Donoghue
A statistical analysis of damage to property using a predictive regression model. Also an investigation to ascertain possible differences in reported divisional burglary rates using ANOVA.
Exploration of Call Transcripts with MapReduce and Zipf’s LawTom Donoghue
This study implements a proof of concept
pipeline to capture web based call transcripts and produces
a word frequency dataset ready for textual analysis
Project report on the design and build of a data warehouse from unstructured and structured data sources (Quandl, yelp and UK Office for National Statistics) using SQL Server 2016, MongoDB and IBM Watson. Design and implementation of business intelligence visualisations using Tableau to answer cross domain business questions
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 1
Are Data Lakes the new Data Warehouse?
Can data lakes provide an organisation with a radical approach to harnessing data,
discovering information and acquiring knowledge, based on Golfarelli’s Business
Intelligence (BI) definition of data, information and knowledge?
Introduction
This paper describes the concept of a data lake and how it compares to a data warehouse.
We review recent research and discuss the definition of both repositories, what types of data
are catered for? Does ingesting data make it available for forging information and beyond
into knowledge? What types of people, process and tools need to be involved to realise the
benefits of using a data lake?
Data Lakes and Data Warehouse?
Sharma (2016) points out that organisations are facing a barrage of data, generated internally
and externally (especially via internet based platforms). Data generation continues to
accelerate, the breadth of unstructured and semi-structured data is in step with this
acceleration. Current systems and methodologies need to change and adapt to the demands
of big data processing. Two areas impacted are the data lake and data warehouse which are
described below and in Figure 1.
Halter et al. (2016) suggest that a data lake provides an alternative way to store high volumes
of data in its native format (be that unstructured, semi-structured or structured) at relatively
low storage costs. The data schemas are unknown when data is loaded, but are revealed as
data in the lake is accessed.
O'Leary (2014) describes a data warehouse as a bolt-on to existing operational systems,
consisting of structured data associated with a specific user base and a specific set of
predefined business queries. The data schema is predefined and structured to facilitate
regular queries. Populating the data warehouse requires multiple extract, transformation and
load (ETL) processes which are also designed in advance.
3. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 2
Aspect Data Lake Data Warehouse
Data Sources Many Few
Data types Unstructured
Semi-structured
Structured
Structured
Schema required on
Load
No, data loaded without
knowledge of data schema
Yes, data schema known prior
to load
Set-up and
configuration
Low implementation cost with
open source components
Specialist skills may be scarce
High cost of proprietary
software licenses, design,
development and maintenance
Near real time data Yes, time between data load and
explore is far shorter
Poor, data tends to have
historic profile. Data only
available once ETL jobs have
completed
Ad hoc query Yes, queries authored at run
time
No, questions asked in
advance, structure must
support query.
Queries authored at design
time.
Flexible support for
cross organisational
questions / analysis
Correct approach provides a
variety of result sets for a wider
and diverse audience
Poor, inflexible predefined
structures only support specific
demands of a known user base
Figure 1: Key aspects of data lakes and data warehouses based on O'Leary (2014) and
Watson (2015).
Harnessing Data
Taking opinion and understanding gained from conference discussions focused on data
lakes, Watson (2015) considers that a data lake is sometimes used as a precursor data store.
Such a store is capable of ingesting copious amounts of unstructured, semi-structured and
structured data, whilst the format of the data is retained. The above suggests that multiple
data type capture is possible, and ties in with the definition above on data type and raw
format preservation. However, it is not clear that amassing data is actually harnessing data.
Fitzgerald (2015) in an interview with General Electric covering their experience of an
operational data lake, notes that at the point of ingestion the data schema is unknown. The
outcome of how data will be used in downstream processes and whether it will add value is
not yet apparent. Industry case studies conducted by Halter et al. (2016) further suggest that,
the data lake is a viable staging candidate for data warehouse input, for example, when
processing unstructured real time data sourced from the internet, data streams and social
media.
4. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 3
Discovering Information
Through studies and exploration of the concept of big data, Sharma (2016) confers that the
data lake can provide a rich source of data for rudimentary exploration by skilled data
scientists and analysts. Fitzgerald (2015) found that General Electric saw 80% of their
talented data scientists’ time was spent on wrangling data into useful information rather than
building models for exploring the outcomes. This indicates the importance of correct
resource allocation in order to glean information from data whilst keeping costs within
acceptable business limits.
In an exploration of industry and academic approaches to BI, data warehousing and big data,
O'Leary (2014) discusses the use of Master Data Management to help mitigate common data
issues. For instance, data inconsistencies appear due to multiple data sources and data
redundancy occurs owing to multiple copies of the same data item. Identifying master data
and its fitness for purpose provides clarity for the organisation including the multiple
applications which rely on data to be consistent. Creation of meaningful metadata attached
to cleansed lake data assists information discovery. Sharma (2016) suggests that it is
plausible to turn a raw data lake into a “smart” data lake through the use of semantic graph
models. Adding context to data facilitates awareness and usability, which gives rise to
information.
Acquiring Knowledge
Halter et al. (2016) confer that a data lake may present an organisation with a competitive
advantage. This means being capable of conducting data analytics and forming insights to
assist business decision making via the acquisition of meaning from disparate data sources.
Taking a business perspective, it is worthwhile discussing and forming processes with
business decision makers to define what data to populate the lake with in the first place
(Watson, 2015). This in turn provides the scope on which to start the search for information,
culminating in knowledge acquisition.
Folding big data in with traditional organisational data for modern data analytics requires the
use of new forms of technology designed specifically to bring about desired results. Based
on the speed, amount and mix of data in this context, existing systems will need to adapt or
be replaced. Queries required to produce sought-after outcomes may well be searching for
data which does not exist in the data warehouse according to Watson (2015). Similar big data
pressure to adapt to change is also recognised in Sharma (2016).
Evaluation
A data lake may not be a panacea for resolving the data issues mentioned above, but it is a
technique that could complement the data warehouse. Both have different underlying
structural requirements and a varied user base which require a varied skillset in order to
extract value from both services (Halter et al., 2016). However, the temptation of lower entry
cost, emerging tool combinations that contextualise data and the expectation of a flexible
and usable way to deal with the surge of big data may attract organisations to build data
lakes. Figure 2 illustrates possible associations between people, process and tools as part
of this evaluation. Examining the suitability of emerging tools in an industry case study,
Armstrong and Barnes (2016) suggests that Hadoop is a common tool of choice for data
lakes due to its low cost of entry and ability to soak up a wide variety of unstructured and
5. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 4
semi-structured data. Tools such as Hadoop combined with NoSQL (Halter et al., 2016) will
facilitate early adopters of data lakes.
Figure 2: People, Process and Tools, based on information compiled from Golfarelli (2004);
Watson (2015) and Fitzgerald (2015).
Further research is required around which emerging tools increase data lake access,
usability, interrogation and security. Skilled resource cost is a common thread, clear role
definition and people management should be examined further to avoid wasteful resource
deployment. People are required to maintain and administer Hadoop based systems, probe
the data lake, identify valuable data for input to downstream experimentation, discovery and
proof of concept generation. Processes also require attention, the risk and impact of new
legislation together with, as suggested by Fitzgerald (2015), gaining a deeper understanding
of governance, provenance and how data is managed when at rest or in transit across
boundaries. A large investment already made in existing data warehouse architecture and
ETL implementations may preclude the adoption of data lakes. Evidence comparing return
on investment for typical data lake and warehouse use cases is an appealing area for further
research. However, according to Armstrong and Barnes (2016), as tools in this space evolve,
use of sandboxes and selective migration of ETL processes into the data lake provide
meaningful feedback to support proof of concept efforts.
If the goal is a unified, consolidated master data store which fully supports integrated
disparate data, capable of serving various levels of analytics (e.g. real time, predictive and
historical) across the entire organisation? Then data lakes could be the first step on that
journey. Its implementation requires skilled resources that create consistent metadata and
data modelling to ensure meaningful outcomes (O'Leary, 2014). The project requires a
business driven strategy (Halter et al., 2016), buy-in by senior management to align priorities
and to connect the technology road map to defined business objectives (Armstrong and
Barnes, 2016).
6. Are Data Lakes the new Data Warehouse?
Tom Donoghue v1.0 Page 5
Bibliography
Armstrong, R. and Barnes, S. (2016) ‘When It's Time to Hadoop’, Business Intelligence
Journal, Volume 21, Issue 1, pp. 32-38.
Fitzgerald, M. (2015), ‘Gone Fishing - for Data’, MIT Sloan Management Review, Volume 56,
Issue 3, pp. 1-5.
Golfarelli, M., Rizzi, S. and Cella, I. (2004). Beyond data warehousing: what's next in
business intelligence? in ‘Proceedings of the 7th ACM international workshop on Data
warehousing and OLAP’, DOLAP ’04. Washington, DC, USA,12-13 November, 2004, pp.
1-6.
Halter, O. and Kromer, M. (2016), ‘Dipping a Toe into Data Lake’, Business Intelligence
Journal, Volume 21, Issue 2, pp. 40-46.
O'Leary, D. E. (2014), ‘Embedding AI and Crowdsourcing in the Big Data Lake’, IEEE
Intelligent Systems, Volume 29, Issue 5, pp. 70-73.
Sharma, S. (2016), ‘Expanded cloud plumes hiding Big Data ecosystem’, Future Generation
Computer Systems, Volume 59, pp. 63-92.
Watson, H. J. (2015), ‘Data Lakes, Data Labs, and Sandboxes’, Business Intelligence
Journal, Volume 20, Issue 1, pp. 4-7.