1. A data lake is a storage repository that holds vast amounts of raw data in its native format until it is needed for analysis. It addresses challenges of big data by allowing data to be stored and analyzed together without upfront structuring.
2. Traditional data warehouses structure data upfront, limiting flexibility. A data lake avoids this by storing all data as-is and analyzing data when questions arise. This provides greater analytic power on emerging big data sources.
3. While data lakes provide benefits like reduced costs and more flexibility, challenges remain around metadata management, governance, preparation, and security when storing all raw data in one place. Effective solutions are needed for these challenges to realize the full potential of data lakes.
Presentation at Data Summit 2015 in NYC.
Elliott Cordo shared real-world insights across a range of topics, including the evolving best practices for building a data warehouse on Hadoop that also coexists with multiple processing frameworks and additional non-Hadoop storage platforms, the place for massively parallel-processing and relational databases in analytic architectures, and the ways in which the cloud offers the ability to quickly and cost-effectively establish a scalable platform for your Big Data warehouse.
For more information, visit www.casertaconcepts.com
Data Lakes are early in the Gartner hype cycle, but companies are getting value from their cloud-based data lake deployments. Break through the confusion between data lakes and data warehouses and seek out the most appropriate use cases for your big data lakes.
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEEDwebwinkelvakdag
Data lakes & data warehouses, whether on-premises or in the cloud promise to provide a centralized, cost-effective and scalable foundation for modern analytics. However, organisations continue to struggle to deliver accurate, current and analytics-ready data sets in a timely fashion. Traditional ingestion tools weren’t designed to handle hundreds or even thousands of data sources and the lack of lineage forces data consumers to manually aggregate information from sources they trust. In this session, you’ll learn how to future-proof your modern data environment to meet the needs of the business for the long term. We'll examine how to overcome common challenges, the related must-have technology solutions in the data lake/ data warehousing world, using real-world success stories and even a few architecture tips from industry experts.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
Presentation at Data Summit 2015 in NYC.
Elliott Cordo shared real-world insights across a range of topics, including the evolving best practices for building a data warehouse on Hadoop that also coexists with multiple processing frameworks and additional non-Hadoop storage platforms, the place for massively parallel-processing and relational databases in analytic architectures, and the ways in which the cloud offers the ability to quickly and cost-effectively establish a scalable platform for your Big Data warehouse.
For more information, visit www.casertaconcepts.com
Data Lakes are early in the Gartner hype cycle, but companies are getting value from their cloud-based data lake deployments. Break through the confusion between data lakes and data warehouses and seek out the most appropriate use cases for your big data lakes.
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEEDwebwinkelvakdag
Data lakes & data warehouses, whether on-premises or in the cloud promise to provide a centralized, cost-effective and scalable foundation for modern analytics. However, organisations continue to struggle to deliver accurate, current and analytics-ready data sets in a timely fashion. Traditional ingestion tools weren’t designed to handle hundreds or even thousands of data sources and the lack of lineage forces data consumers to manually aggregate information from sources they trust. In this session, you’ll learn how to future-proof your modern data environment to meet the needs of the business for the long term. We'll examine how to overcome common challenges, the related must-have technology solutions in the data lake/ data warehousing world, using real-world success stories and even a few architecture tips from industry experts.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
Speaker: Geetha Balasundaram, Developer at ThoughtWorks
From tools and technology to people and requirements, what's different in the data engineering space? App development is traditional now. All enterprises want to become data-guided. Data lake is good start yet the know-hows and do-hows are so many.
Experiences from building a data lake in the retail domain, the talk will be covering.
- What is this vast new space of data engineering,
- Why it is critical to think in terms of data rather than features
- How important it is to understand these technologies and create a data lake that is usable and insightful to business
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
Joe Caserta, President at Caserta Concepts, presented "Setting Up the Data Lake" at a DAMA Philadelphia Chapter Meeting.
For more information on the services offered by Caserta Concepts, visit our website at http://casertaconcepts.com/.
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
Why and How has the Big Data based Enterprise Data Lake solution based on No-SQL and SQL technologies has become significantly effective in solving enterprise data challenges than its predecessor EDW which had tried and failed to solve the same problem entirely based on SQL database only.
Building the Enterprise Data Lake: A look at architecturemark madsen
The topic is building an Enterprise Data Lake, discussing high level data and technology architecture. We will describe the architecture of a data warehouse, how a data lake needs to differ, and show a high level functional and data architecture for a data lake. This webinar will cover:
Why dumping data into Hadoop and letting users get it out doesn't work
The difference between a Hadoop application and a Data Lake
Why new ideas about data architecture are a key element
An Enterprise Data Lake reference architecture to frame what must be built
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.
Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.
From Traditional Data Warehouse To Real Time Data WarehouseOsama Hussein
Summarising the 'From Traditional Data Warehouse To Real Time Data Warehouse' paper.
1. S. Bouaziz, A. Nabli and F. Gargouri, "From Traditional Data Warehouse To Real Time Data Warehouse", 2017.
This paper describes the concept of a data lake and how it compares to a data warehouse. A review recent research and discussion of the definition of both repositories, what types of data are catered for? Does ingesting data make it available for forging information and beyond
into knowledge? What types of people, process and tools need to be involved to realise the
benefits of using a data lake?
Caserta Concepts, Datameer and Microsoft shared their combined knowledge and a use case on big data, the cloud and deep analytics. Attendes learned how a global leader in the test, measurement and control systems market reduced their big data implementations from 18 months to just a few.
Speakers shared how to provide a business user-friendly, self-service environment for data discovery and analytics, and focus on how to extend and optimize Hadoop based analytics, highlighting the advantages and practical applications of deploying on the cloud for enhanced performance, scalability and lower TCO.
Agenda included:
- Pizza and Networking
- Joe Caserta, President, Caserta Concepts - Why are we here?
- Nikhil Kumar, Sr. Solutions Engineer, Datameer - Solution use cases and technical demonstration
- Stefan Groschupf, CEO & Chairman, Datameer - The evolving Hadoop-based analytics trends and the role of cloud computing
- James Serra, Data Platform Solution Architect, Microsoft, Benefits of the Azure Cloud Service
- Q&A, Networking
For more information on Caserta Concepts, visit our website: http://casertaconcepts.com/
Enterprise Data Lake:
How to Conquer the Data Deluge and Derive Insights
that Matters
Data can be traced from various consumer sources.
Managing data is one of the most serious challenges faced
by organizations today. Organizations are adopting the data
lake models because lakes provide raw data that users can
use for data experimentation and advanced analytics.
A data lake could be a merging point of new and historic
data, thereby drawing correlations across all data using
advanced analytics. A data lake can support the self-service
data practices. This can tap undiscovered business value
from various new as well as existing data sources.
Furthermore, a data lake can aid data warehousing,
analytics, data integration by modernizing. However, lakes
also face hindrances like immature governance, user skills
and security.
Traditional BI vs. Business Data Lake – A ComparisonCapgemini
Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses on structured data but they are not designed to handle unstructured data.
For these systems Big Data brings big problems because the data that flows in may be either structured or unstructured. That makes them hugely limited when it comes to delivering Big Data benefits.
The way forward is a complete rethink of the way we use BI - in terms of how the data is ingested, stored and analyzed.
More information: http://www.capgemini.com/big-data-analytics/pivotal
Speaker: Geetha Balasundaram, Developer at ThoughtWorks
From tools and technology to people and requirements, what's different in the data engineering space? App development is traditional now. All enterprises want to become data-guided. Data lake is good start yet the know-hows and do-hows are so many.
Experiences from building a data lake in the retail domain, the talk will be covering.
- What is this vast new space of data engineering,
- Why it is critical to think in terms of data rather than features
- How important it is to understand these technologies and create a data lake that is usable and insightful to business
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
Joe Caserta, President at Caserta Concepts, presented "Setting Up the Data Lake" at a DAMA Philadelphia Chapter Meeting.
For more information on the services offered by Caserta Concepts, visit our website at http://casertaconcepts.com/.
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
Why and How has the Big Data based Enterprise Data Lake solution based on No-SQL and SQL technologies has become significantly effective in solving enterprise data challenges than its predecessor EDW which had tried and failed to solve the same problem entirely based on SQL database only.
Building the Enterprise Data Lake: A look at architecturemark madsen
The topic is building an Enterprise Data Lake, discussing high level data and technology architecture. We will describe the architecture of a data warehouse, how a data lake needs to differ, and show a high level functional and data architecture for a data lake. This webinar will cover:
Why dumping data into Hadoop and letting users get it out doesn't work
The difference between a Hadoop application and a Data Lake
Why new ideas about data architecture are a key element
An Enterprise Data Lake reference architecture to frame what must be built
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.
Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.
From Traditional Data Warehouse To Real Time Data WarehouseOsama Hussein
Summarising the 'From Traditional Data Warehouse To Real Time Data Warehouse' paper.
1. S. Bouaziz, A. Nabli and F. Gargouri, "From Traditional Data Warehouse To Real Time Data Warehouse", 2017.
This paper describes the concept of a data lake and how it compares to a data warehouse. A review recent research and discussion of the definition of both repositories, what types of data are catered for? Does ingesting data make it available for forging information and beyond
into knowledge? What types of people, process and tools need to be involved to realise the
benefits of using a data lake?
Caserta Concepts, Datameer and Microsoft shared their combined knowledge and a use case on big data, the cloud and deep analytics. Attendes learned how a global leader in the test, measurement and control systems market reduced their big data implementations from 18 months to just a few.
Speakers shared how to provide a business user-friendly, self-service environment for data discovery and analytics, and focus on how to extend and optimize Hadoop based analytics, highlighting the advantages and practical applications of deploying on the cloud for enhanced performance, scalability and lower TCO.
Agenda included:
- Pizza and Networking
- Joe Caserta, President, Caserta Concepts - Why are we here?
- Nikhil Kumar, Sr. Solutions Engineer, Datameer - Solution use cases and technical demonstration
- Stefan Groschupf, CEO & Chairman, Datameer - The evolving Hadoop-based analytics trends and the role of cloud computing
- James Serra, Data Platform Solution Architect, Microsoft, Benefits of the Azure Cloud Service
- Q&A, Networking
For more information on Caserta Concepts, visit our website: http://casertaconcepts.com/
Enterprise Data Lake:
How to Conquer the Data Deluge and Derive Insights
that Matters
Data can be traced from various consumer sources.
Managing data is one of the most serious challenges faced
by organizations today. Organizations are adopting the data
lake models because lakes provide raw data that users can
use for data experimentation and advanced analytics.
A data lake could be a merging point of new and historic
data, thereby drawing correlations across all data using
advanced analytics. A data lake can support the self-service
data practices. This can tap undiscovered business value
from various new as well as existing data sources.
Furthermore, a data lake can aid data warehousing,
analytics, data integration by modernizing. However, lakes
also face hindrances like immature governance, user skills
and security.
Traditional BI vs. Business Data Lake – A ComparisonCapgemini
Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses on structured data but they are not designed to handle unstructured data.
For these systems Big Data brings big problems because the data that flows in may be either structured or unstructured. That makes them hugely limited when it comes to delivering Big Data benefits.
The way forward is a complete rethink of the way we use BI - in terms of how the data is ingested, stored and analyzed.
More information: http://www.capgemini.com/big-data-analytics/pivotal
For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
In this paper, Impetus focuses at why organizations need to design an Enterprise Data Warehouse (EDW) to support the business analytics derived from the Big Data.
The main idea of a Data Lake is to expose the company data in an agile and flexible way to the people within the company, but preserve safeguard and auditing features that are required for the company’s critical data. The way that most projects in this direction start out is by depositing all of the data in Hadoop, trying to infer the schema on top of the data and then use the data for analytics purposes via Hive or Spark. Described stack is a really good approach for many use cases, as it provides cheaply storing data in files and rich analytics on top. But many pitfalls and problems might show up on this road, which can be easily met by extending the toolset. The potential bottlenecks will be displayed as soon as the users arrive and start exploiting the Lake. From all of these reasons, planning and building a Data Lake within an organization requires strategic approach, in order to build an architecture that can support it.
Polestar we hope to bring the power of data to organizations across industries helping them analyze billions of data points and data sets to provide real-time insights, and enabling them to make critical decisions to grow their business.
Data lakes are central repositories that store large volumes of structured, unstructured, and semi-structured data. They are ideal for machine learning use cases and support SQL-based access and programmatic distributed data processing frameworks. Data lakes can store data in the same format as its source systems or transform it before storing it. They support native streaming and are best suited for storing raw data without an intended use case. Data quality and governance practices are crucial to avoid a data swamp. Data lakes enable end-users to leverage insights for improved business performance and enable advanced analytics.
Modern Integrated Data Environment - Whitepaper | QuboleVasu S
A whit-paper is about building a modern data platform for data driven organisations with using cloud data warehouse with modern data platform architecture
https://www.qubole.com/resources/white-papers/modern-integrated-data-environment
Top 60+ Data Warehouse Interview Questions and Answers.pdfDatacademy.ai
This is a comprehensive guide to the most frequently asked data warehouse interview questions and answers. It covers a wide range of topics including data warehousing concepts, ETL processes, dimensional modeling, data storage, and more. The guide aims to assist job seekers, students, and professionals in preparing for data warehouse job interviews and exams.
Unlock Your Data for ML & AI using Data VirtualizationDenodo
How Denodo Complement’s Logical Data Lake in Cloud
● Denodo does not substitute data warehouses, data lakes,
ETLs...
● Denodo enables the use of all together plus other data
sources
○ In a logical data warehouse
○ In a logical data lake
○ They are very similar, the only difference is in the main
objective
● There are also use cases where Denodo can be used as data
source in a ETL flow
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Data lakes
1. 1
DATA LAKES
Big Data Requires a Big, New Architecture
Şaban Dalaman
İstanbul Şehir University, İstanbul, Turkey
sabandalaman@std.sehir.edu.tr
Abstract— what is a data lake? How does it help with the
challenges appearing with big data? How is it related to the current
enterprise data warehouse? How will the data lake and the
enterprise data warehouse be used together? How can you get
started on the journey of incorporating a data lake into your
architecture?
Index Terms— Apache Hadoop, Data Lake, Big Data
I. INTRODUCTION
The concept of a data lake is emerging as a popular way to
organize and build the next generation of systems to master
new big data challenges. It is not Apache™ Hadoop® but the
power of data that is expanding our view of analytical
ecosystems to integrate existing and new data into what called
as a logical data warehouse. As an important component of
this logical data warehouse, companies are seeking to create
data lakes because they manage and use data with increased
volume, variety, and a velocity rarely seen in the past. But what
is a data lake? How does it help with the challenges posed by
big data? How is it related to the current enterprise data
warehouse? How will the data lake and the enterprise data
warehouse be used together? How can you get started on the
journey of incorporating a data lake into your architecture?
RE-THINKING REPOSITIORIES[1]
• The massive explosion of sources of information
• How to take maximum advantage of big data?
• In the world of big data, we don’t really know what
value the data has when it’s initially accepted from
the array of sources available to us.
• IT is going to have to press the re-start button on its
architecture for acquiring and understanding
information.
• IT will need to construct a new way of capturing,
organizing and analyzing data,
Big data stands no chance of being useful if people attempt
to process it using the traditional mechanisms of business
intelligence, such as a data warehouses and traditional data-
analysis techniques
II. HISTORY[2]
The term was coined by James Dixon, Pentaho chief
technology officer.
Dixon used the term initially to contrast with "data mart",
which is a smaller repository of interesting attributes extracted
from the raw data.
He says in short "If you think of a datamart as a store of
bottled water – cleansed and packaged and structured for easy
consumption – the data lake is a large body of water in a more
natural state. The contents of the data lake stream in from a
source to fill the lake, and various users of the lake can come
to examine, dive in, or take samples."
Dixon identified 2 shortcomings of conventional datamarts:
"Only a subset of the attributes are examined, so only pre-
determined questions can be answered." and "The data is
aggregated so visibility into the lowest levels is lost."
Therefore, storing data in some “optimal” form for later
analysis doesn’t make any sense. Instead, what the Dixon
suggests is storing the data in a massive, easily accessible
repository based on the cheap storage that’s available today.
Then, when there are questions that need answers, which is
the time to organize and sift through the chunks of data that
will provide those answers. Determine the structure of the
data at the time of search, not at the time of storage.
III. DEFINITION[3]
Wikipedia: A data lake is a large storage repository and
processing engine, they provide "massive storage for any kind
of data, enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs".
Gartner: A data lake is a collection of storage instances of
various data assets additional to the originating data sources.
2. 2
These assets are stored in a near-exact, or even exact, copy of
the source format.
Techtarget: A data lake is a storage repository that holds a vast
amount of raw data in its native format until it is needed.
Microsoft: Data Lake - Batch, real-time and interactive
analytics made easy.
EMC2: Data Lake Foundation gives you a single system to
capture, store, analyse, protect and manage your data.
Capgemini: Discover a new approach to addressing your
company’s information challenges. Embracing Big Data
satisfies both local and corporate needs from an integrated
environment. We call it the Business Data Lake.
Cognizant: Your mission (whether or not you accept it) is to not
only manage the sheer bulk of data, but to also draw meaning
from the bits and bytes. This requires going way beyond
traditional data repositories to what we call the data lake. You
won't be able to afford the time, effort and cost of loading all
this data into a big data repository, nor could you easily find
and use the data you need in it.
As you can see, there is no generally accepted definition for
Data Lake.
DATA LAKES: It’s a concept, not a place
We may overcome this confusion by putting what are
priciples for a DL
A data lake is a storage repository that holds a vast amount
of raw data in its nativeformat until it is needed. While a
hierarchical data warehouse stores data in files or folders, a
data lake uses a flat architecture to store data. Each data
element in a lake is assigned a unique identifier and tagged
with a set of extended metadata tags. When a business
question arises, the data lake can be queried for relevant data,
and that smaller set of data can then be analyzed to help
answer the question.
Like big data, the term data lake is sometimes disparaged as
being simply a marketing label for a product that supports
Hadoop. Increasingly, however, the term is being accepted as
a way to describe any large data pool in which the schema and
data requirements are not defined until the data is queried.
The problem is that, in the world of big data, we don’t really
know what value the data has when it’s initially accepted from
the array of sources available to us. We might know some
questions we want to answer, but not to the extent that it
makes sense to close off the ability to answer questions that
materialize later. Therefore, storing data in some “optimal”
form for later analysis doesn’t make any sense. Instead, what
it is suggested is storing the data in a massive, easily accessible
repository based on the cheap storage that’s available today.
Then, when there are questions that need answers, that is the
time to organize and sift through the chunks of data that will
provide those answers.
The Business Data Lake changes the way IT looks at
information in a traditional EDW approach. It embraces the
following new principles[9]:
Land all the information you can as is with no modification
Encourage LOB to create point solutions
Let LOB decide on the cost/performance for their problem
Concentrate governance on the critical points only
Consider the corporate view to be just another LOB view
Unstructured information is still information
Never assume the lake contains everything
Scale is driven by demands – scale down as well as up
These new principles drive a new approach, one that
delivers what IT needs – a cost effective solution in a way that
leverages the business need for local views.
3. 3
IV. FOUR CHALLENGES OF DATA LAKES[8]
Meta Data Management
A data lake is only truly valuable to an organization if its data
is tagged and catalogued. Unfortunately, applying the right
metadata at the right moment to the right data within the data
lake can be a challenge.
Data Governance
Data governance is a challenge for any organization dealing
with data in general and big data specifically.
Data Preparation
Ensure proper dealing and preparation with the data
Data Security
Having all data in one central location, security becomes an
issue
V.BENEFITS OF THE BUSINESS DATA LAKE[8]
A Business Data Lake is a storage area for all data sources.
Data can be pulled/pushed directly from the data sources
into the Storage Area. All data in raw form are available in
one place.
Limitations on the data volumes and storage cost are
significantly reduced through the use of commodity
hardware.
Once all data is brought into the Lake, users can pull
relevant data for analysis. They can analyse and derive
new insights from the data without knowing its initial
structure. APIs that search the data structures in the
Business Data Lake and provide the metadata information
are currently being created. These APIs play a key role in
deriving new insights from ad hoc data analysis.
As new data sources get added to the environment, they
can simply be loaded into the Business Data Lake and a
data refinement/enrichment process created, based on
the business need.
The main drawback of creating a data model up-front is
eliminated. Traditional data modelling, which is done up-
front, fails in a Big Data environment for two reasons: the
nature of the incoming data and the limitation on the
analysis that it allows. The Business Data Lake overcomes
these two limitations by providing a loosely coupled
architecture that enables flexibility of analysis.
Based on repetitive requirements, relevant subject areas
that are used frequently for standard / canned reports can
be loaded into the data warehouse in a dimensional form
and the rest of the data can continue to reside inside the
Business Data Lake for analytics on need.
A data governance framework can be built on top of the
Business Data Lake for relevant enterprise data. This
framework can be extended to additional data based on
requirements.
The Business Data Lake meets local business requirements
as well as enterprisewide needs from the same data store.
The enterprise view of the data can be considered as
another local view.
Being able to move data across from the sources and turn
it around quickly to derive business outcomes is key to the
success of a Business Data Lake, an area where traditional
BI implementations fail to meet business needs.
4. 4
VI. Architecture Comparison — Traditional BI and
Business Data Lake
Figure 1. [9]
As we see from figure-1, a Business Data Lake is able to:
• Receive and store high volume and volatile structured,
semi-structured and unstructured data in near-real time using
low cost commodity hardware
• Provide a platform to perform near-real time analytics and
business processing on the data in the lake
• Provide a business view that is tailored to specific LOBs as
well the enterprise.
The Business Data Lake does this in a way which enables
users to reduce the business solution implementation time,
by:
• Eliminating the dependency of data modelling up-front
and thereby letting all data flow in
• Reducing the time taken to build robust ETL process to load
the data into the structured data stores, which are bound to
change
• Eliminating an over-engineered metadata layer
• Providing the capability to view the same data in different
dimensions and derive new patterns and relationships that lie
within the data.
Figure 2. [9]
Figure 3.[9]
5. 5
VII. Some examples of Data Lake architectures
Business Data Lake Architecture – Pivotal[6]
Business Data Lake Architecture – Microsoft[5]
Federation Business Data Lake – EMC[11]
Teradata – Hortonworks[13]
As can be easily seen from examples, major players from
market have some kind of solution for data lake architectures.
They are similar in structure but providing different kind of
products for different components of Data Lake.
The most important part is the data ingestion solutions. Here
companies should provide for storing data without losing any
valuable asset.
The next key part of the Business Data Lake is the concept of
distillation. This is where the business creates maps onto the
source data histories contained in the Data Storage tier to
generate the view that matches their current requirements.
The goal here should be to enable the business to extract
any information they are allowed to: privacy and security can
be enforced through the distillation process. These maps can
be reused by others or just discarded, as can the point
information solutions if required.
By providing the business with access to all of the raw
information, operational reporting systems can now be
created in the same environment as long-term financial
planning and corporate reporting. Critically, this removes the
business need to create point solutions.
PERSONAL DATA LAKE ARCHITECTURE
Personal Data Lake[12]
We may see a future in which each individual has their own
Personal Data Lake that stores all the digital data accumulated
in their lifetime -- emails, photos, medical records, invoices,
bills, payments, certificates, phone calls, to name just a few
examples. Although it is intuitive to trust an individual to take
care of their own data like they do with their physical
belongings, it requires a fundamental shift in how we
6. 6
handle data and build the economy on top of it. Figure
illustrates the two different personal data pathways.
The Personal Data Lake research reported in this paper was
initiated late last year. The following points support the
principles discussed here for building such a lake.
• Data privacy and security is at the heart of building a
personal data storage utility to empower personal users with
full control over their data, as well as to benefit the community
(in an tightly controllable manner)
• A data lake is an optimum storage solution for personal
data because of the 3V nature of personal data.
• A successful data lake relies on a successful metadata
management system, as well as on a data
processing/analysis/query system
This project is still at the early stage of implementation. In
the near future we are going to see the solution for personal
use.
VI. CRITISIM
Customers creating big data graveyards, dumping
everything into HDFS and hoping to do something with it
down the road. But then they just lose track of what’s
there
"The main challenge is not creating a data lake, but taking
advantage of the opportunities it presents."[15]
In June 2015 David Needle characterized "so-called data
lakes" as "one of the more controversial ways to
manage big data".[14]
“We see customers creating big data graveyards, dumping
everything into HDFS and hoping to do something with it
down the road. But then they just lose track of what’s
there”.[15]
“Gartner Says Beware of the Data Lake Fallacy”[16]
"Summary: A Data Lake is not a data warehouse housed in
Hadoop. If you store data from many systems and join across
them, you have a Water Garden, not a Data Lake. "
James Dixon
REFERENCES
[1] Woods, Dan http://www.forbes.com/sites/ciocentral/2011/07/21/big-
data-requires-a-big-new-architecture/ 2011
[2] Dixon, James https://jamesdixon.wordpress.com/2014/09/25/data-
lakes-revisited/ 2014
[3] Chinnakali, Kumar
http://www.datasciencecentral.com/profiles/blogs/the-collective-
definition-of-data-lake-by-big-data-community, 2015
[4] EMC2, Data Lakes for Big Data,2015
[5] https://azure.microsoft.com/en-us/solutions/data-lake/ 2015
[6] Pivoltal, http://www.slideshare.net/capgemini/detection-of-anomalous-
behavior-41986267 2014
[7] Kelly, Thomas, PMP http://www.slideshare.net/ThomasKellyPMP/the-
emerging-data-lake-it-strategy?next_slideshow=1 2014
[8] Rijmenam, Mark van https://datafloq.com/read/Data-Lakes-Open-
Possibilities-Your-Organization/1695 2015
[9] Capgemini, The Principles of the Business Data Lake 2015
[10] Capgemini, Traditional BI vs. Business Data Lake –
A Comparison 2015
[11] EMC2, Federation Business Data Lake – Enabling Comprehensive Data
Services 2015
[12] Alrehamy, Hassan Walker, Coral, Personal Data Lake With Data Gravity
Pull 2015
[13] CITO Research Teradata Hortonworks; Putting the Data Lake to Work
A Guide to Best Practices 2014
[14] Needle, David http://www.eweek.com/enterprise-apps/hadoop-summit-
wrangling-big-data-requires-novel-tools-techniques-2.html 2015
[15] Stein, Brian; Morrison, Alan. Data lakes and the promise of unsiloed data
(Report). Technology Forecast: Rethinking integration.
PricewaterhouseCooper 2014
[16] http://www.gartner.com/newsroom/id/2809117