SlideShare a Scribd company logo
1 of 6
Download to read offline
1
DATA LAKES
Big Data Requires a Big, New Architecture
Şaban Dalaman
İstanbul Şehir University, İstanbul, Turkey
sabandalaman@std.sehir.edu.tr
Abstract— what is a data lake? How does it help with the
challenges appearing with big data? How is it related to the current
enterprise data warehouse? How will the data lake and the
enterprise data warehouse be used together? How can you get
started on the journey of incorporating a data lake into your
architecture?
Index Terms— Apache Hadoop, Data Lake, Big Data
I. INTRODUCTION
The concept of a data lake is emerging as a popular way to
organize and build the next generation of systems to master
new big data challenges. It is not Apache™ Hadoop® but the
power of data that is expanding our view of analytical
ecosystems to integrate existing and new data into what called
as a logical data warehouse. As an important component of
this logical data warehouse, companies are seeking to create
data lakes because they manage and use data with increased
volume, variety, and a velocity rarely seen in the past. But what
is a data lake? How does it help with the challenges posed by
big data? How is it related to the current enterprise data
warehouse? How will the data lake and the enterprise data
warehouse be used together? How can you get started on the
journey of incorporating a data lake into your architecture?
RE-THINKING REPOSITIORIES[1]
• The massive explosion of sources of information
• How to take maximum advantage of big data?
• In the world of big data, we don’t really know what
value the data has when it’s initially accepted from
the array of sources available to us.
• IT is going to have to press the re-start button on its
architecture for acquiring and understanding
information.
• IT will need to construct a new way of capturing,
organizing and analyzing data,
Big data stands no chance of being useful if people attempt
to process it using the traditional mechanisms of business
intelligence, such as a data warehouses and traditional data-
analysis techniques
II. HISTORY[2]
The term was coined by James Dixon, Pentaho chief
technology officer.
Dixon used the term initially to contrast with "data mart",
which is a smaller repository of interesting attributes extracted
from the raw data.
He says in short "If you think of a datamart as a store of
bottled water – cleansed and packaged and structured for easy
consumption – the data lake is a large body of water in a more
natural state. The contents of the data lake stream in from a
source to fill the lake, and various users of the lake can come
to examine, dive in, or take samples."
Dixon identified 2 shortcomings of conventional datamarts:
"Only a subset of the attributes are examined, so only pre-
determined questions can be answered." and "The data is
aggregated so visibility into the lowest levels is lost."
Therefore, storing data in some “optimal” form for later
analysis doesn’t make any sense. Instead, what the Dixon
suggests is storing the data in a massive, easily accessible
repository based on the cheap storage that’s available today.
Then, when there are questions that need answers, which is
the time to organize and sift through the chunks of data that
will provide those answers. Determine the structure of the
data at the time of search, not at the time of storage.
III. DEFINITION[3]
Wikipedia: A data lake is a large storage repository and
processing engine, they provide "massive storage for any kind
of data, enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs".
Gartner: A data lake is a collection of storage instances of
various data assets additional to the originating data sources.
2
These assets are stored in a near-exact, or even exact, copy of
the source format.
Techtarget: A data lake is a storage repository that holds a vast
amount of raw data in its native format until it is needed.
Microsoft: Data Lake - Batch, real-time and interactive
analytics made easy.
EMC2: Data Lake Foundation gives you a single system to
capture, store, analyse, protect and manage your data.
Capgemini: Discover a new approach to addressing your
company’s information challenges. Embracing Big Data
satisfies both local and corporate needs from an integrated
environment. We call it the Business Data Lake.
Cognizant: Your mission (whether or not you accept it) is to not
only manage the sheer bulk of data, but to also draw meaning
from the bits and bytes. This requires going way beyond
traditional data repositories to what we call the data lake. You
won't be able to afford the time, effort and cost of loading all
this data into a big data repository, nor could you easily find
and use the data you need in it.
As you can see, there is no generally accepted definition for
Data Lake.
DATA LAKES: It’s a concept, not a place
We may overcome this confusion by putting what are
priciples for a DL
A data lake is a storage repository that holds a vast amount
of raw data in its nativeformat until it is needed. While a
hierarchical data warehouse stores data in files or folders, a
data lake uses a flat architecture to store data. Each data
element in a lake is assigned a unique identifier and tagged
with a set of extended metadata tags. When a business
question arises, the data lake can be queried for relevant data,
and that smaller set of data can then be analyzed to help
answer the question.
Like big data, the term data lake is sometimes disparaged as
being simply a marketing label for a product that supports
Hadoop. Increasingly, however, the term is being accepted as
a way to describe any large data pool in which the schema and
data requirements are not defined until the data is queried.
The problem is that, in the world of big data, we don’t really
know what value the data has when it’s initially accepted from
the array of sources available to us. We might know some
questions we want to answer, but not to the extent that it
makes sense to close off the ability to answer questions that
materialize later. Therefore, storing data in some “optimal”
form for later analysis doesn’t make any sense. Instead, what
it is suggested is storing the data in a massive, easily accessible
repository based on the cheap storage that’s available today.
Then, when there are questions that need answers, that is the
time to organize and sift through the chunks of data that will
provide those answers.
The Business Data Lake changes the way IT looks at
information in a traditional EDW approach. It embraces the
following new principles[9]:
 Land all the information you can as is with no modification
 Encourage LOB to create point solutions
 Let LOB decide on the cost/performance for their problem
 Concentrate governance on the critical points only
 Consider the corporate view to be just another LOB view
 Unstructured information is still information
 Never assume the lake contains everything
 Scale is driven by demands – scale down as well as up
These new principles drive a new approach, one that
delivers what IT needs – a cost effective solution in a way that
leverages the business need for local views.
3
IV. FOUR CHALLENGES OF DATA LAKES[8]
 Meta Data Management
A data lake is only truly valuable to an organization if its data
is tagged and catalogued. Unfortunately, applying the right
metadata at the right moment to the right data within the data
lake can be a challenge.
 Data Governance
Data governance is a challenge for any organization dealing
with data in general and big data specifically.
 Data Preparation
Ensure proper dealing and preparation with the data
 Data Security
Having all data in one central location, security becomes an
issue
V.BENEFITS OF THE BUSINESS DATA LAKE[8]
 A Business Data Lake is a storage area for all data sources.
Data can be pulled/pushed directly from the data sources
into the Storage Area. All data in raw form are available in
one place.
 Limitations on the data volumes and storage cost are
significantly reduced through the use of commodity
hardware.
 Once all data is brought into the Lake, users can pull
relevant data for analysis. They can analyse and derive
new insights from the data without knowing its initial
structure. APIs that search the data structures in the
Business Data Lake and provide the metadata information
are currently being created. These APIs play a key role in
deriving new insights from ad hoc data analysis.
 As new data sources get added to the environment, they
can simply be loaded into the Business Data Lake and a
data refinement/enrichment process created, based on
the business need.
 The main drawback of creating a data model up-front is
eliminated. Traditional data modelling, which is done up-
front, fails in a Big Data environment for two reasons: the
nature of the incoming data and the limitation on the
analysis that it allows. The Business Data Lake overcomes
these two limitations by providing a loosely coupled
architecture that enables flexibility of analysis.
 Based on repetitive requirements, relevant subject areas
that are used frequently for standard / canned reports can
be loaded into the data warehouse in a dimensional form
and the rest of the data can continue to reside inside the
Business Data Lake for analytics on need.
 A data governance framework can be built on top of the
Business Data Lake for relevant enterprise data. This
framework can be extended to additional data based on
requirements.
 The Business Data Lake meets local business requirements
as well as enterprisewide needs from the same data store.
The enterprise view of the data can be considered as
another local view.
 Being able to move data across from the sources and turn
it around quickly to derive business outcomes is key to the
success of a Business Data Lake, an area where traditional
BI implementations fail to meet business needs.
4
VI. Architecture Comparison — Traditional BI and
Business Data Lake
Figure 1. [9]
As we see from figure-1, a Business Data Lake is able to:
• Receive and store high volume and volatile structured,
semi-structured and unstructured data in near-real time using
low cost commodity hardware
• Provide a platform to perform near-real time analytics and
business processing on the data in the lake
• Provide a business view that is tailored to specific LOBs as
well the enterprise.
The Business Data Lake does this in a way which enables
users to reduce the business solution implementation time,
by:
• Eliminating the dependency of data modelling up-front
and thereby letting all data flow in
• Reducing the time taken to build robust ETL process to load
the data into the structured data stores, which are bound to
change
• Eliminating an over-engineered metadata layer
• Providing the capability to view the same data in different
dimensions and derive new patterns and relationships that lie
within the data.
Figure 2. [9]
Figure 3.[9]
5
VII. Some examples of Data Lake architectures
Business Data Lake Architecture – Pivotal[6]
Business Data Lake Architecture – Microsoft[5]
Federation Business Data Lake – EMC[11]
Teradata – Hortonworks[13]
As can be easily seen from examples, major players from
market have some kind of solution for data lake architectures.
They are similar in structure but providing different kind of
products for different components of Data Lake.
The most important part is the data ingestion solutions. Here
companies should provide for storing data without losing any
valuable asset.
The next key part of the Business Data Lake is the concept of
distillation. This is where the business creates maps onto the
source data histories contained in the Data Storage tier to
generate the view that matches their current requirements.
The goal here should be to enable the business to extract
any information they are allowed to: privacy and security can
be enforced through the distillation process. These maps can
be reused by others or just discarded, as can the point
information solutions if required.
By providing the business with access to all of the raw
information, operational reporting systems can now be
created in the same environment as long-term financial
planning and corporate reporting. Critically, this removes the
business need to create point solutions.
PERSONAL DATA LAKE ARCHITECTURE
Personal Data Lake[12]
We may see a future in which each individual has their own
Personal Data Lake that stores all the digital data accumulated
in their lifetime -- emails, photos, medical records, invoices,
bills, payments, certificates, phone calls, to name just a few
examples. Although it is intuitive to trust an individual to take
care of their own data like they do with their physical
belongings, it requires a fundamental shift in how we
6
handle data and build the economy on top of it. Figure
illustrates the two different personal data pathways.
The Personal Data Lake research reported in this paper was
initiated late last year. The following points support the
principles discussed here for building such a lake.
• Data privacy and security is at the heart of building a
personal data storage utility to empower personal users with
full control over their data, as well as to benefit the community
(in an tightly controllable manner)
• A data lake is an optimum storage solution for personal
data because of the 3V nature of personal data.
• A successful data lake relies on a successful metadata
management system, as well as on a data
processing/analysis/query system
This project is still at the early stage of implementation. In
the near future we are going to see the solution for personal
use.
VI. CRITISIM
 Customers creating big data graveyards, dumping
everything into HDFS and hoping to do something with it
down the road. But then they just lose track of what’s
there
 "The main challenge is not creating a data lake, but taking
advantage of the opportunities it presents."[15]
 In June 2015 David Needle characterized "so-called data
lakes" as "one of the more controversial ways to
manage big data".[14]
 “We see customers creating big data graveyards, dumping
everything into HDFS and hoping to do something with it
down the road. But then they just lose track of what’s
there”.[15]
 “Gartner Says Beware of the Data Lake Fallacy”[16]
"Summary: A Data Lake is not a data warehouse housed in
Hadoop. If you store data from many systems and join across
them, you have a Water Garden, not a Data Lake. "
James Dixon
REFERENCES
[1] Woods, Dan http://www.forbes.com/sites/ciocentral/2011/07/21/big-
data-requires-a-big-new-architecture/ 2011
[2] Dixon, James https://jamesdixon.wordpress.com/2014/09/25/data-
lakes-revisited/ 2014
[3] Chinnakali, Kumar
http://www.datasciencecentral.com/profiles/blogs/the-collective-
definition-of-data-lake-by-big-data-community, 2015
[4] EMC2, Data Lakes for Big Data,2015
[5] https://azure.microsoft.com/en-us/solutions/data-lake/ 2015
[6] Pivoltal, http://www.slideshare.net/capgemini/detection-of-anomalous-
behavior-41986267 2014
[7] Kelly, Thomas, PMP http://www.slideshare.net/ThomasKellyPMP/the-
emerging-data-lake-it-strategy?next_slideshow=1 2014
[8] Rijmenam, Mark van https://datafloq.com/read/Data-Lakes-Open-
Possibilities-Your-Organization/1695 2015
[9] Capgemini, The Principles of the Business Data Lake 2015
[10] Capgemini, Traditional BI vs. Business Data Lake –
A Comparison 2015
[11] EMC2, Federation Business Data Lake – Enabling Comprehensive Data
Services 2015
[12] Alrehamy, Hassan Walker, Coral, Personal Data Lake With Data Gravity
Pull 2015
[13] CITO Research Teradata Hortonworks; Putting the Data Lake to Work
A Guide to Best Practices 2014
[14] Needle, David http://www.eweek.com/enterprise-apps/hadoop-summit-
wrangling-big-data-requires-novel-tools-techniques-2.html 2015
[15] Stein, Brian; Morrison, Alan. Data lakes and the promise of unsiloed data
(Report). Technology Forecast: Rethinking integration.
PricewaterhouseCooper 2014
[16] http://www.gartner.com/newsroom/id/2809117

More Related Content

What's hot

Whitepaper-The-Data-Lake-3_0
Whitepaper-The-Data-Lake-3_0Whitepaper-The-Data-Lake-3_0
Whitepaper-The-Data-Lake-3_0Jane Roberts
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveGeekNightHyderabad
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Big Data and Data Virtualization
Big Data and Data VirtualizationBig Data and Data Virtualization
Big Data and Data VirtualizationKenneth Peeples
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceInformation Security Awareness Group
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
From Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data WarehouseFrom Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data WarehouseOsama Hussein
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data WarehousesTom Donoghue
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lakeBHASKAR CHAUDHURY
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 

What's hot (20)

Whitepaper-The-Data-Lake-3_0
Whitepaper-The-Data-Lake-3_0Whitepaper-The-Data-Lake-3_0
Whitepaper-The-Data-Lake-3_0
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Big Data and Data Virtualization
Big Data and Data VirtualizationBig Data and Data Virtualization
Big Data and Data Virtualization
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security Alliance
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
From Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data WarehouseFrom Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data Warehouse
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data Warehouses
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lake
 
The new EDW
The new EDWThe new EDW
The new EDW
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 

Similar to Data lakes

Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lakesambiswal
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonCapgemini
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos MilovanovicInstitute of Contemporary Sciences
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsJane Roberts
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleVasu S
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfDatacademy.ai
 
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdffinal-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdfXIAOZEJIN1
 
bigdatasqloverview21jan2015-2408000
bigdatasqloverview21jan2015-2408000bigdatasqloverview21jan2015-2408000
bigdatasqloverview21jan2015-2408000Kartik Padmanabhan
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationDenodo
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Data warehouse
Data warehouseData warehouse
Data warehouseRajThakuri
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 

Similar to Data lakes (20)

Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A Comparison
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | Qubole
 
BigData Analytics
BigData AnalyticsBigData Analytics
BigData Analytics
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdf
 
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdffinal-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
 
bigdatasqloverview21jan2015-2408000
bigdatasqloverview21jan2015-2408000bigdatasqloverview21jan2015-2408000
bigdatasqloverview21jan2015-2408000
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 

Recently uploaded

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 

Data lakes

  • 1. 1 DATA LAKES Big Data Requires a Big, New Architecture Şaban Dalaman İstanbul Şehir University, İstanbul, Turkey sabandalaman@std.sehir.edu.tr Abstract— what is a data lake? How does it help with the challenges appearing with big data? How is it related to the current enterprise data warehouse? How will the data lake and the enterprise data warehouse be used together? How can you get started on the journey of incorporating a data lake into your architecture? Index Terms— Apache Hadoop, Data Lake, Big Data I. INTRODUCTION The concept of a data lake is emerging as a popular way to organize and build the next generation of systems to master new big data challenges. It is not Apache™ Hadoop® but the power of data that is expanding our view of analytical ecosystems to integrate existing and new data into what called as a logical data warehouse. As an important component of this logical data warehouse, companies are seeking to create data lakes because they manage and use data with increased volume, variety, and a velocity rarely seen in the past. But what is a data lake? How does it help with the challenges posed by big data? How is it related to the current enterprise data warehouse? How will the data lake and the enterprise data warehouse be used together? How can you get started on the journey of incorporating a data lake into your architecture? RE-THINKING REPOSITIORIES[1] • The massive explosion of sources of information • How to take maximum advantage of big data? • In the world of big data, we don’t really know what value the data has when it’s initially accepted from the array of sources available to us. • IT is going to have to press the re-start button on its architecture for acquiring and understanding information. • IT will need to construct a new way of capturing, organizing and analyzing data, Big data stands no chance of being useful if people attempt to process it using the traditional mechanisms of business intelligence, such as a data warehouses and traditional data- analysis techniques II. HISTORY[2] The term was coined by James Dixon, Pentaho chief technology officer. Dixon used the term initially to contrast with "data mart", which is a smaller repository of interesting attributes extracted from the raw data. He says in short "If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." Dixon identified 2 shortcomings of conventional datamarts: "Only a subset of the attributes are examined, so only pre- determined questions can be answered." and "The data is aggregated so visibility into the lowest levels is lost." Therefore, storing data in some “optimal” form for later analysis doesn’t make any sense. Instead, what the Dixon suggests is storing the data in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, which is the time to organize and sift through the chunks of data that will provide those answers. Determine the structure of the data at the time of search, not at the time of storage. III. DEFINITION[3] Wikipedia: A data lake is a large storage repository and processing engine, they provide "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs". Gartner: A data lake is a collection of storage instances of various data assets additional to the originating data sources.
  • 2. 2 These assets are stored in a near-exact, or even exact, copy of the source format. Techtarget: A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. Microsoft: Data Lake - Batch, real-time and interactive analytics made easy. EMC2: Data Lake Foundation gives you a single system to capture, store, analyse, protect and manage your data. Capgemini: Discover a new approach to addressing your company’s information challenges. Embracing Big Data satisfies both local and corporate needs from an integrated environment. We call it the Business Data Lake. Cognizant: Your mission (whether or not you accept it) is to not only manage the sheer bulk of data, but to also draw meaning from the bits and bytes. This requires going way beyond traditional data repositories to what we call the data lake. You won't be able to afford the time, effort and cost of loading all this data into a big data repository, nor could you easily find and use the data you need in it. As you can see, there is no generally accepted definition for Data Lake. DATA LAKES: It’s a concept, not a place We may overcome this confusion by putting what are priciples for a DL A data lake is a storage repository that holds a vast amount of raw data in its nativeformat until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question. Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. The problem is that, in the world of big data, we don’t really know what value the data has when it’s initially accepted from the array of sources available to us. We might know some questions we want to answer, but not to the extent that it makes sense to close off the ability to answer questions that materialize later. Therefore, storing data in some “optimal” form for later analysis doesn’t make any sense. Instead, what it is suggested is storing the data in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers. The Business Data Lake changes the way IT looks at information in a traditional EDW approach. It embraces the following new principles[9]:  Land all the information you can as is with no modification  Encourage LOB to create point solutions  Let LOB decide on the cost/performance for their problem  Concentrate governance on the critical points only  Consider the corporate view to be just another LOB view  Unstructured information is still information  Never assume the lake contains everything  Scale is driven by demands – scale down as well as up These new principles drive a new approach, one that delivers what IT needs – a cost effective solution in a way that leverages the business need for local views.
  • 3. 3 IV. FOUR CHALLENGES OF DATA LAKES[8]  Meta Data Management A data lake is only truly valuable to an organization if its data is tagged and catalogued. Unfortunately, applying the right metadata at the right moment to the right data within the data lake can be a challenge.  Data Governance Data governance is a challenge for any organization dealing with data in general and big data specifically.  Data Preparation Ensure proper dealing and preparation with the data  Data Security Having all data in one central location, security becomes an issue V.BENEFITS OF THE BUSINESS DATA LAKE[8]  A Business Data Lake is a storage area for all data sources. Data can be pulled/pushed directly from the data sources into the Storage Area. All data in raw form are available in one place.  Limitations on the data volumes and storage cost are significantly reduced through the use of commodity hardware.  Once all data is brought into the Lake, users can pull relevant data for analysis. They can analyse and derive new insights from the data without knowing its initial structure. APIs that search the data structures in the Business Data Lake and provide the metadata information are currently being created. These APIs play a key role in deriving new insights from ad hoc data analysis.  As new data sources get added to the environment, they can simply be loaded into the Business Data Lake and a data refinement/enrichment process created, based on the business need.  The main drawback of creating a data model up-front is eliminated. Traditional data modelling, which is done up- front, fails in a Big Data environment for two reasons: the nature of the incoming data and the limitation on the analysis that it allows. The Business Data Lake overcomes these two limitations by providing a loosely coupled architecture that enables flexibility of analysis.  Based on repetitive requirements, relevant subject areas that are used frequently for standard / canned reports can be loaded into the data warehouse in a dimensional form and the rest of the data can continue to reside inside the Business Data Lake for analytics on need.  A data governance framework can be built on top of the Business Data Lake for relevant enterprise data. This framework can be extended to additional data based on requirements.  The Business Data Lake meets local business requirements as well as enterprisewide needs from the same data store. The enterprise view of the data can be considered as another local view.  Being able to move data across from the sources and turn it around quickly to derive business outcomes is key to the success of a Business Data Lake, an area where traditional BI implementations fail to meet business needs.
  • 4. 4 VI. Architecture Comparison — Traditional BI and Business Data Lake Figure 1. [9] As we see from figure-1, a Business Data Lake is able to: • Receive and store high volume and volatile structured, semi-structured and unstructured data in near-real time using low cost commodity hardware • Provide a platform to perform near-real time analytics and business processing on the data in the lake • Provide a business view that is tailored to specific LOBs as well the enterprise. The Business Data Lake does this in a way which enables users to reduce the business solution implementation time, by: • Eliminating the dependency of data modelling up-front and thereby letting all data flow in • Reducing the time taken to build robust ETL process to load the data into the structured data stores, which are bound to change • Eliminating an over-engineered metadata layer • Providing the capability to view the same data in different dimensions and derive new patterns and relationships that lie within the data. Figure 2. [9] Figure 3.[9]
  • 5. 5 VII. Some examples of Data Lake architectures Business Data Lake Architecture – Pivotal[6] Business Data Lake Architecture – Microsoft[5] Federation Business Data Lake – EMC[11] Teradata – Hortonworks[13] As can be easily seen from examples, major players from market have some kind of solution for data lake architectures. They are similar in structure but providing different kind of products for different components of Data Lake. The most important part is the data ingestion solutions. Here companies should provide for storing data without losing any valuable asset. The next key part of the Business Data Lake is the concept of distillation. This is where the business creates maps onto the source data histories contained in the Data Storage tier to generate the view that matches their current requirements. The goal here should be to enable the business to extract any information they are allowed to: privacy and security can be enforced through the distillation process. These maps can be reused by others or just discarded, as can the point information solutions if required. By providing the business with access to all of the raw information, operational reporting systems can now be created in the same environment as long-term financial planning and corporate reporting. Critically, this removes the business need to create point solutions. PERSONAL DATA LAKE ARCHITECTURE Personal Data Lake[12] We may see a future in which each individual has their own Personal Data Lake that stores all the digital data accumulated in their lifetime -- emails, photos, medical records, invoices, bills, payments, certificates, phone calls, to name just a few examples. Although it is intuitive to trust an individual to take care of their own data like they do with their physical belongings, it requires a fundamental shift in how we
  • 6. 6 handle data and build the economy on top of it. Figure illustrates the two different personal data pathways. The Personal Data Lake research reported in this paper was initiated late last year. The following points support the principles discussed here for building such a lake. • Data privacy and security is at the heart of building a personal data storage utility to empower personal users with full control over their data, as well as to benefit the community (in an tightly controllable manner) • A data lake is an optimum storage solution for personal data because of the 3V nature of personal data. • A successful data lake relies on a successful metadata management system, as well as on a data processing/analysis/query system This project is still at the early stage of implementation. In the near future we are going to see the solution for personal use. VI. CRITISIM  Customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But then they just lose track of what’s there  "The main challenge is not creating a data lake, but taking advantage of the opportunities it presents."[15]  In June 2015 David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage big data".[14]  “We see customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But then they just lose track of what’s there”.[15]  “Gartner Says Beware of the Data Lake Fallacy”[16] "Summary: A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake. " James Dixon REFERENCES [1] Woods, Dan http://www.forbes.com/sites/ciocentral/2011/07/21/big- data-requires-a-big-new-architecture/ 2011 [2] Dixon, James https://jamesdixon.wordpress.com/2014/09/25/data- lakes-revisited/ 2014 [3] Chinnakali, Kumar http://www.datasciencecentral.com/profiles/blogs/the-collective- definition-of-data-lake-by-big-data-community, 2015 [4] EMC2, Data Lakes for Big Data,2015 [5] https://azure.microsoft.com/en-us/solutions/data-lake/ 2015 [6] Pivoltal, http://www.slideshare.net/capgemini/detection-of-anomalous- behavior-41986267 2014 [7] Kelly, Thomas, PMP http://www.slideshare.net/ThomasKellyPMP/the- emerging-data-lake-it-strategy?next_slideshow=1 2014 [8] Rijmenam, Mark van https://datafloq.com/read/Data-Lakes-Open- Possibilities-Your-Organization/1695 2015 [9] Capgemini, The Principles of the Business Data Lake 2015 [10] Capgemini, Traditional BI vs. Business Data Lake – A Comparison 2015 [11] EMC2, Federation Business Data Lake – Enabling Comprehensive Data Services 2015 [12] Alrehamy, Hassan Walker, Coral, Personal Data Lake With Data Gravity Pull 2015 [13] CITO Research Teradata Hortonworks; Putting the Data Lake to Work A Guide to Best Practices 2014 [14] Needle, David http://www.eweek.com/enterprise-apps/hadoop-summit- wrangling-big-data-requires-novel-tools-techniques-2.html 2015 [15] Stein, Brian; Morrison, Alan. Data lakes and the promise of unsiloed data (Report). Technology Forecast: Rethinking integration. PricewaterhouseCooper 2014 [16] http://www.gartner.com/newsroom/id/2809117