SlideShare a Scribd company logo
STEPS FOR
ARCHITECTING
A DATA LAKE
How to maximize intelligence
by unifying enterprise data
© 2018 MetroStar Systems, Inc. - All Rights Reserved
5
© 2018 MetroStar Systems, Inc. - All Rights Reserved 2
5 STEPS FOR ARCHITECTING A DATA LAKE
TABLE OF CONTENTS
SECTION 1: INTRODUCTION ……………………..…………………………………………………………….. 3
Data Growth Challenges……………..……………………………………………………………… 4
SECTION 2: WHAT IS A DATA LAKE? ……………………….……………………………………………..… 5
How Does a Data Lake Work? …………………………………………………………………… 6
Data Lake vs Traditional Approach …….……………………..……………………………… 7
SECTION 3: DATA LAKE REQUIREMENTS ………………………………………………………………… 8
Creating a Successful Data Lake…………………………………………………………………. 9
Data Lake Governance……………………………………………………..……………………… 10
Selecting the Right Platform…………………………………………….……………………… 11
SECTION 4: 5 STEPS FOR ARCHITECTING A DATA LAKE…….…………………………………...... 12
1. Ingestion & Storage ……………………………………………………………………………. 13
2. Data Processing ………………………………………………………………….………………. 14
3. Robust Data Governance ……………………………………………………………………. 15
4. Data Retrieval and Visualization …………………………………………………………. 16
5. Advanced Analytics………………………………………………………………................. 17
Overview of a Data Lake’s Capabilities ……………..…………………….………………. 18
SECTION 5: MAXIMIZING THE VALUE OF A DATA LAKE …….……………………………………. 19
Data Revolves Around Citizens……………..………………………………….………………. 20
Enhancing Citizen Experience……………..……………………………….….………………. 21
ASSESSING READINESS………………………………………………………………………………………….. 22
SECTION 1:
INTRODUCTION
© 2018 MetroStar Systems, Inc. - All Rights Reserved 3
© 2018 MetroStar Systems, Inc. - All Rights Reserved 4
DATA GROWTH CHALLENGES
5 STEPS FOR ARCHITECTING A DATA LAKE | INTRODUCTION
Data Growth Challenges:
 High overhead costs due to
inflexible architecture and
legacy technology
maintenance
 Antiquated data environments
that suffer from poor master
data management practices
 Low data integrity due to a
lack of a single source of truth
with respect to the data
 Inability to provide internal
users, analysts, developers,
and management the tools
needed to perform their
respective roles at the high
caliber of quality expected
from today’s workplace
Enterprises that do not
employ Data Lake platforms
can find themselves being
outpaced by the rate of their
agency’s data growth.
AS AN AGENCY GROWS SO DOES ITS DATA. Data is no longer limited to structured,
relational, and/or transactional in nature. Data now includes semi-structured, unstructured,
operational log, social media, free-text, and more. The ability to ingest data of all varieties
is imperative to gaining a holistic understanding of the digital ecosystem. Agencies can
leverage cutting-edge technologies with wide-ranging, high integrity data sources to derive
powerful insights to their operational and theoretical questions. By coupling the robust
technologies of a Data Lake with the flexible, cost effect capabilities of a Cloud Service
Provider (CSP) such as Amazon Web Services (AWS) or Microsoft Azure, among others,
the value the Data Lake offers becomes a powerful asset for agencies large and small.
Source: http://infosysblogs.com/brandededge/2013/04/20130419infographic.html
SECTION 2:
WHAT IS A DATA
LAKE?
© 2018 MetroStar Systems, Inc. - All Rights Reserved 5
© 2018 MetroStar Systems, Inc. - All Rights Reserved 6
HOW DOES A DATA LAKE WORK?
5 STEPS FOR ARCHITECTING A DATA LAKE | WHAT IS A DATA LAKE?
A Data Lake is a natural
maturation of data
migrating to a single
environment. The Data
Lake provides capabilities
seldom seen in IT
enterprises that employ
disparate data stores and
databases.
“A data lake is like a large
body of water in a more
natural state. The contents
of the data lake stream in
from a source to fill the
lake, and various users of
the lake can come to
examine, dive in, or take
samples.”
– James Dixon, CTO, Pentaho
© 2018 MetroStar Systems, Inc. - All Rights Reserved 7
DATA LAKE vs TRADITIONAL APPROACH
5 STEPS FOR ARCHITECTING A DATA LAKE | WHAT IS A DATA LAKE?
DATA LAKE TRADITIONAL
Data Storage
Structured, semi-structured, or
unstructured data can be stored at
low costs and can be stored with a
schema (e.g. relational) or can be
schema-less.
Data is stored in vertically scaling
relational database management
systems (RDBMS) at high costs.
Advanced
Analytics
Analytics can be run on any and all
data sets in real-time (e.g. in
memory machine learning
algorithms) without requiring
upfront manual processing or
preparation.
Data typically has to be manually
prepared and integrated from
multiple sources, which can be a
significant barrier to generating
rapid insights.
Enterprise
Data
Taxonomy
Multiple taxonomies, schemas, and
standards can exist in a single data
environment while being applied by
different data stakeholder groups.
Agencies struggled in the past to
create a single taxonomy or
schema to represent the enterprise
data model.
User Access
Control
Data is tagged at ingestion (and
automatically analyzed on read)
with the appropriate authorization
rules. Authentication can be
controlled through single sign on
(SSO) capabilities.
Data authentication and
authorization is specified using
manually-controlled and disparate
tools (e.g. Access Control Lists).
Business
Intelligence
Information and analytics are
conveyed using automated,
feature-rich, dashboards and
visualizations.
Information and analytics are
presented in compiled, static
reports.
Data Lake implementations
using Big Data technologies
like Hadoop, represent a
transformational paradigm
shift in the data enterprise
objectives for agencies. This
shift allows existing legacy
or traditional approaches to
data utilization to drastically
advance forward.
SECTION 3:
DATA LAKE
REQUIREMENTS
© 2018 MetroStar Systems, Inc. - All Rights Reserved 8
© 2018 MetroStar Systems, Inc. - All Rights Reserved 9
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS
CREATING A SUCCESSFUL DATA LAKE
Scaling the data value proposition of the Data Lake starts by making data
accessible and easy to use. The Data Lake’s data consumers will have
diverse needs, so using a common data storage and access infrastructure
alongside a fully featured Cloud Service Provider (e.g., Amazon Web
Services, Microsoft Azure, etc.) provides the capabilities and flexibilities
needed to drive innovative uses of data and data services.
Using best of breed open-source cloud architectures to overcome “vendor
lock-in” challenges, for a Data Lake eliminates linkage maintenance of stove-
piped systems, increases ease of data use, expedites delivery, and ultimately
reduces the risks/costs associated with achieving innovation.
A successful Data Lake implementation also allows data across the agency
to be integrated and leveraged in a sophisticated solution, and begins with a
modular, modern cluster-based (multiple interconnected servers) architecture
that is grounded in a flexible infrastructure platform.
A significant challenge
when striving for innovative
results is “vendor lock-in,”
which is caused by
proprietary commercial-off-
the-shelf (COTS)
technologies that make it
difficult to modify, scale, or
transition to new data
uses/services.
© 2018 MetroStar Systems, Inc. - All Rights Reserved 10
DATA LAKE GOVERNANCE
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS
Without data lake governance, businesses could be left without meaningful business intelligence -- or even jeopardize the
business.
© 2018 MetroStar Systems, Inc. - All Rights Reserved 11
SELECTING THE RIGHT PLATFORM
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS
Agencies have successfully
used AWS to support
workloads and solutions
with data from Controlled
Unclassified Information
(CUI) to Top Secret
classifications.
 AWS Elastic
MapReduce (EMR), a
managed Hadoop,
Spark, and Presto
Solution
 EMR Ingests with a
number of AWS Services
 AWS also has real-time
analytics, predictive
analytics, and data
dashboard and
visualization capabilities
 AWS has been used to
support government
missions in health and
human sciences,
defense, intelligence,
statistical, regulatory,
and financial industries
 Azure includes the
managed Apache
platform HDInsight
(Hadoop, Spark, Storm,
Hbase)
 HDInsight includes a
local Hadoop Distributed
File System (HDFS),
connected to the Data
Lake
 Azure Data Lake Store
can store data in its
native format, without
prior transformations
 Recently added Azure
Data Lake Analytics, a
serverless hyper-scale
data storage and
analytical platform
 Fully managed Hadoop
and Spark offering
 Provides a fully
programmable
framework for Java and
Python
 Cloud Dataflow & Spark
for pipeline execution
 Machine Learning as a
fully management
platform for training and
hosting
 Google offers a Cloud
Machine Learning
Engine to build model
based on TensorFlow’s
deep learning library
*Comparisons shown above based on August 2017 data
SECTION 4:
5 STEPS FOR ARCHITECTING
A DATA LAKE
© 2018 MetroStar Systems, Inc. - All Rights Reserved 12
© 2018 MetroStar Systems, Inc. - All Rights Reserved 13
1. INGESTION & STORAGE
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
DATA INGESTION
To begin data ingestion, agencies must perform an
analysis of the high value data sources present in the
enterprise. These data sources are typically relational
and/or transactional and offer quick-win opportunities to
establish the Data Lake as the center for a single source of
truth. The processes used to obtain and capture data can
be iterated upon, and open source tools can reduce the
complexities of data ingestion configuration.
DATA STORAGE
By developing a data pipeline, events called processors
that can handle specific extract, transform, and load (ETL)
processes on incoming data are implemented. For data
that requires more advanced processing, native tools can
help bridge the gap between data collection, data ETL
(including applying governance policies and access
control), and data storage. For storing data that is from
relational sources, native technologies can be used.
PROPER DATA INGESTION IS CRUCIAL TO THE SUCCESS OF A DATA
LAKE. Understanding the velocity, size, format, and frequency of the
data being ingested, and how it will be analyzed ensures the
architecture properly accommodates data.
© 2018 MetroStar Systems, Inc. - All Rights Reserved 14
2. DATA PROCESSING
The processing capabilities of a Data Lake enable
innovative and creative questioning to happen at speeds
and scales never before seen in legacy data processing
environments. Queries and workloads run across the Data
Lake cluster of nodes as opposed to on single servers,
which reduces the resources required by a single server.
This maximizes the Data Lake’s ability to deliver results in
a timely, streamlined way.
The freedom and expressive ability of a Data Lake’s
processing paradigms allows users to think beyond simply
asking questions of single data sources (e.g., a query
performed on a relational data store).
Newer technologies allow entire datasets across the Data
Lake to be loaded into the memory of the cluster, further
reducing the time to compute heavy workloads, and
delivering results up to 100 times faster. By decreasing the
barriers of complexity to access, and extract value out of
the agency’s data, the Data Lake’s processing paradigms
advance the ability to gain new insights from the data.
From challenges as simple as the word count of a dataset, to as
complicated as processing streaming biometric information, no
workload is too small, too large, too simple, or too complex to be
performed inside the Data Lake.
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
© 2018 MetroStar Systems, Inc. - All Rights Reserved 15
3. ROBUST DATA GOVERNANCE
Data Lakes offer a single source of truth for an agency.
Therefore, it’s imperative that the data is appropriately
secured and only accessed by authorized individuals. Data
accountability can be established by using a combination
of native tools to ensure that users are only authorized to
view and execute actions that are approved for their role.
This accountability also allows security and audit
specialists to easily evaluate the data configurations and
operations across the Data Lake.
In addition to restricting access, an important piece in the
data and information access control strategy is
implementing data governance, retention, and linage
policies. Introducing these types of policies at the point of
ingestion to the Data Lake automates an otherwise tedious
and complicated process.
Conducting stakeholder interviews to gain an
understanding of target high-value data systems enables a
holistic understanding of the taxonomies present in the
enterprise, and establishes the data governance and
access needs of the Data Lake.
Governance combines quality, management, policy management,
business process management, and risk management to ensure data
is formally and properly managed.
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
© 2018 MetroStar Systems, Inc. - All Rights Reserved 16
4. DATA RETRIEVAL & VISUALIZATION
One of the most important components of a Data Lake is
the ability to retrieve, analyze, visualize, and share insights
derived from data. Communicating data visually is directly
in line with the key pillars of a successful Data Lake.
Legacy COTS reporting tools are not designed to provide
the creative, captivating, and accessible analytics and
insight desired by users.
This means that the Data Lake’s tools must support the
dynamic challenge of enabling users to easily prepare
visually compelling data stories. As data-related problems
grow in size and complexity, traditional reverse-engineered
analysis methods that require pre-formulated hypotheses
and data source/schema decisions become more
expensive, less accurate, and too rigid for analysts to use
to make timely decisions.
Custom data visualization tools are well-suited to providing
an agency with a platform to deliver visual reporting based
on public data, which can be delivered right to a user’s
email, via built-in automation features.
Today’s data user is accustomed to interaction with apps and data on
their personal devices via sophisticated user experiences and
compelling visual narratives.
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
© 2018 MetroStar Systems, Inc. - All Rights Reserved 17
5. ADVANCED ANALYTICS
Traditionally, data science projects were incredibly costly
due to the amount of resources needed to perform the
analytical processing required by certain algorithms and
processes. These barriers made the field of data science
difficult to access, because a successful project was too
expensive in both time and costs. However, with a Data
Lake, the ability to process data at huge scales is now
more readily available for data science applications.
An agency can exploit the capabilities found in the Data
Lake by using its cluster based data processing
paradigms. Advanced analytical techniques commonly
found in data science applications, can then be applied.
These techniques include machine learning, natural
language processing, image processing, data mining,
predictive analytics, statistical analytics, and more.
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
© 2018 MetroStar Systems, Inc. - All Rights Reserved 18
OVERVIEW OF A DATA LAKE’S CAPABILITIES
Successfully implementing a Data Lake environment
requires an advanced understanding of the analytical
insight possibilities the holistic platform provides via its
mixed ecosystem of cutting-edge open-source
technologies and best-of-breed commercial software.
Identifying the best approach for developing and
implementing the components, and the end goal of the
insights to be derived from a Data Lake is critical for
architecting a successful environment.
Incorporating best practices for analyzing, interpreting,
and understanding data science-generated results to
support data-driven decision making also helps ensure
success. Best practices, coupled with building teams with
skillsets in mathematics, computer science, and domain
expertise to solve complex data challenges allows
agencies to maximize data discovery, data-driven
decision making, and return on analytics innovation. All
of which is built on a foundation of standardized
metadata, firm access protocols, intelligent discovery
mechanisms, and a flexible data governance process to
reduce data silos.
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
SECTION 5:
MAXIMIZING THE
VALUE OF A DATA LAKE
© 2018 MetroStar Systems, Inc. - All Rights Reserved 19
© 2018 MetroStar Systems, Inc. - All Rights Reserved 20
DATA REVOLVES AROUND CITIZENS
A Data Lake is only as
powerful as the insights an
agency is able to derive
from its contents. Those
insights are only as
valuable as the agency’s
ability to power change via
them.
This end state requires the
ability for stakeholders and
users to derive insights
leveraging a Citizen
Engagement Model (CEM)
integration.
Using a component driven
design and development
approach leveraging best
practices from Human
Centered Design and Agile
principles will help
agencies increase the
usability, searchability,
findability, and extensibility
of their data.
5 STEPS FOR ARCHITECTING A DATA LAKE | MAXIMIZING THE VALUE OF A DATA LAKE
© 2018 MetroStar Systems, Inc. - All Rights Reserved 21
ENHANCING CITIZEN EXPERIENCE
By integrating the citizen-
centric data lake with the
CEM, agencies are able to
gather new, valuable insights
from previously siloed
datasets. Those insights :
 Enable quantitative
assessment of changing
customer needs and
technological innovations
 Identify metrics, KPIs, and
requirements needed to
build CEM dashboards
 Identify additional data
sources required
 Improve relevancy of
search index and
recommendations related
to structured and
unstructured searches
 Provide support to create,
maintain, and improve
loading process
 Support configuration and
maintenance of the
current data environments
5 STEPS FOR ARCHITECTING A DATA LAKE | MAXIMIZING THE VALUE OF A DATA LAKE
Properly architecting a Data Lake will provide agencies with numerous benefits including
low-cost storage, custom configurations, unified enterprise data, and the ability to
securely scale – all of which provide agencies with a unique competitive advantage.
The delivery of the Data Lake does not end with architecting, deploying, integrating, and
configuring the solution. The Data Lake is built on the concept of removing barriers to
innovating with data, but without proper education delivered by expert practitioners in the
field of Data Science, Big Data, and Cloud Computing, the opportunities the Data Lake
enable cannot be fully recognized.
Having a team of highly skilled experts supporting a Data Lake is pertinent to the realization
of a fully functioning Data Lake. Our team, comprised of full-service data scientists have
specializations across Big Data, large-scale data platforms, advanced analytics,
mathematical modeling, and computer science are uniquely qualified to provide the level of
educational care our customers require.
We possess deep technical expertise in open source development technologies and
containerization methods that bring efficiencies to development efforts and have a deep
bench of software developer consultants bringing the greatest level of technical acumen
and availability. Our team is not only an avid user and implementer of open source
software, but has also given back to the open source community as active contributors to
the Apache Accumulo, Hadoop, NiFi, and Mahout projects.
ABOUT METROSTAR SYSTEMS
MetroStar Systems has been a trusted partner, delivering leading-edge technology
solutions to federal and defense agencies since 1999. MetroStar’s unique blend of cross-
functional experts across three practice areas: Cybersecurity, Digital, and Enterprise IT,
enables the successful delivery of transformative solutions. Learn more about our work
implementing data lakes for federal agencies: https://www.metrostarsystems.com
© 2018 MetroStar Systems, Inc. - All Rights Reserved 22
5 STEPS FOR ARCHITECTING A DATA LAKE | ASSESSING READINESS
ASSESSING READINESS
TO LEARN MORE ABOUT METROSTAR SYSTEMS:
Contact: Debbie Peterson
1856 Old Reston Avenue, Suite 100
Reston, VA 20190
703.481.9581
dpeterson@metrostarsystems.com
www.metrostarsystems.com
© Copyright 2018 MetroStar Systems, Inc., This document is current as of the initial date of publication and may be changed by MetroStar Systems at any time. The
performance data and examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and
operating conditions. THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT
ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.

More Related Content

What's hot

Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)
DATAVERSITY
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best Practices
Capgemini
 
Slides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceSlides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data Governance
DATAVERSITY
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
Whitepaper on Master Data Management
Whitepaper on Master Data Management Whitepaper on Master Data Management
Whitepaper on Master Data Management
Jagruti Dwibedi ITIL
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
DATAVERSITY
 
Master Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and GovernanceMaster Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and Governance
DATAVERSITY
 

What's hot (20)

Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best Practices
 
Slides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceSlides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data Governance
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Whitepaper on Master Data Management
Whitepaper on Master Data Management Whitepaper on Master Data Management
Whitepaper on Master Data Management
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Master Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and GovernanceMaster Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and Governance
 

Similar to 5 Steps for Architecting a Data Lake

final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdffinal-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
XIAOZEJIN1
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
sambiswal
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
sambiswal
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
Sheetal Pratik
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
Sun Technologies
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
Jane Roberts
 
IBM Cloud pak for data brochure
IBM Cloud pak for data   brochureIBM Cloud pak for data   brochure
IBM Cloud pak for data brochure
Simon Harrison ACMA CGMA
 
Crafting highly scalable and performant Modern Data Platforms
Crafting highly scalable and performant Modern Data PlatformsCrafting highly scalable and performant Modern Data Platforms
Crafting highly scalable and performant Modern Data Platforms
Sameer Paradkar
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
ArunPandiyan890855
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesData Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Denodo
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
Analytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle ApplicationsAnalytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle Applications
Ray Février
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
DATAVERSITY
 
Fast Data Strategy Houston Roadshow Presentation
Fast Data Strategy Houston Roadshow PresentationFast Data Strategy Houston Roadshow Presentation
Fast Data Strategy Houston Roadshow Presentation
Denodo
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
hktripathy
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
Information Security Awareness Group
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
Denodo
 
Accenture-Cloud-Data-Migration-POV-Final.pdf
Accenture-Cloud-Data-Migration-POV-Final.pdfAccenture-Cloud-Data-Migration-POV-Final.pdf
Accenture-Cloud-Data-Migration-POV-Final.pdf
Rajvir Kaushal
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
ElsonPaul2
 
Application Of A New Database Management System
Application Of A New Database Management SystemApplication Of A New Database Management System
Application Of A New Database Management System
Pamela Wright
 

Similar to 5 Steps for Architecting a Data Lake (20)

final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdffinal-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
 
IBM Cloud pak for data brochure
IBM Cloud pak for data   brochureIBM Cloud pak for data   brochure
IBM Cloud pak for data brochure
 
Crafting highly scalable and performant Modern Data Platforms
Crafting highly scalable and performant Modern Data PlatformsCrafting highly scalable and performant Modern Data Platforms
Crafting highly scalable and performant Modern Data Platforms
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesData Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
Analytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle ApplicationsAnalytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle Applications
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
Fast Data Strategy Houston Roadshow Presentation
Fast Data Strategy Houston Roadshow PresentationFast Data Strategy Houston Roadshow Presentation
Fast Data Strategy Houston Roadshow Presentation
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Accenture-Cloud-Data-Migration-POV-Final.pdf
Accenture-Cloud-Data-Migration-POV-Final.pdfAccenture-Cloud-Data-Migration-POV-Final.pdf
Accenture-Cloud-Data-Migration-POV-Final.pdf
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Application Of A New Database Management System
Application Of A New Database Management SystemApplication Of A New Database Management System
Application Of A New Database Management System
 

Recently uploaded

Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
FIDO Alliance
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
janagijoythi
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
SynapseIndia
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
SubhamMandal40
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 

Recently uploaded (20)

Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 

5 Steps for Architecting a Data Lake

  • 1. STEPS FOR ARCHITECTING A DATA LAKE How to maximize intelligence by unifying enterprise data © 2018 MetroStar Systems, Inc. - All Rights Reserved 5
  • 2. © 2018 MetroStar Systems, Inc. - All Rights Reserved 2 5 STEPS FOR ARCHITECTING A DATA LAKE TABLE OF CONTENTS SECTION 1: INTRODUCTION ……………………..…………………………………………………………….. 3 Data Growth Challenges……………..……………………………………………………………… 4 SECTION 2: WHAT IS A DATA LAKE? ……………………….……………………………………………..… 5 How Does a Data Lake Work? …………………………………………………………………… 6 Data Lake vs Traditional Approach …….……………………..……………………………… 7 SECTION 3: DATA LAKE REQUIREMENTS ………………………………………………………………… 8 Creating a Successful Data Lake…………………………………………………………………. 9 Data Lake Governance……………………………………………………..……………………… 10 Selecting the Right Platform…………………………………………….……………………… 11 SECTION 4: 5 STEPS FOR ARCHITECTING A DATA LAKE…….…………………………………...... 12 1. Ingestion & Storage ……………………………………………………………………………. 13 2. Data Processing ………………………………………………………………….………………. 14 3. Robust Data Governance ……………………………………………………………………. 15 4. Data Retrieval and Visualization …………………………………………………………. 16 5. Advanced Analytics………………………………………………………………................. 17 Overview of a Data Lake’s Capabilities ……………..…………………….………………. 18 SECTION 5: MAXIMIZING THE VALUE OF A DATA LAKE …….……………………………………. 19 Data Revolves Around Citizens……………..………………………………….………………. 20 Enhancing Citizen Experience……………..……………………………….….………………. 21 ASSESSING READINESS………………………………………………………………………………………….. 22
  • 3. SECTION 1: INTRODUCTION © 2018 MetroStar Systems, Inc. - All Rights Reserved 3
  • 4. © 2018 MetroStar Systems, Inc. - All Rights Reserved 4 DATA GROWTH CHALLENGES 5 STEPS FOR ARCHITECTING A DATA LAKE | INTRODUCTION Data Growth Challenges:  High overhead costs due to inflexible architecture and legacy technology maintenance  Antiquated data environments that suffer from poor master data management practices  Low data integrity due to a lack of a single source of truth with respect to the data  Inability to provide internal users, analysts, developers, and management the tools needed to perform their respective roles at the high caliber of quality expected from today’s workplace Enterprises that do not employ Data Lake platforms can find themselves being outpaced by the rate of their agency’s data growth. AS AN AGENCY GROWS SO DOES ITS DATA. Data is no longer limited to structured, relational, and/or transactional in nature. Data now includes semi-structured, unstructured, operational log, social media, free-text, and more. The ability to ingest data of all varieties is imperative to gaining a holistic understanding of the digital ecosystem. Agencies can leverage cutting-edge technologies with wide-ranging, high integrity data sources to derive powerful insights to their operational and theoretical questions. By coupling the robust technologies of a Data Lake with the flexible, cost effect capabilities of a Cloud Service Provider (CSP) such as Amazon Web Services (AWS) or Microsoft Azure, among others, the value the Data Lake offers becomes a powerful asset for agencies large and small. Source: http://infosysblogs.com/brandededge/2013/04/20130419infographic.html
  • 5. SECTION 2: WHAT IS A DATA LAKE? © 2018 MetroStar Systems, Inc. - All Rights Reserved 5
  • 6. © 2018 MetroStar Systems, Inc. - All Rights Reserved 6 HOW DOES A DATA LAKE WORK? 5 STEPS FOR ARCHITECTING A DATA LAKE | WHAT IS A DATA LAKE? A Data Lake is a natural maturation of data migrating to a single environment. The Data Lake provides capabilities seldom seen in IT enterprises that employ disparate data stores and databases. “A data lake is like a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” – James Dixon, CTO, Pentaho
  • 7. © 2018 MetroStar Systems, Inc. - All Rights Reserved 7 DATA LAKE vs TRADITIONAL APPROACH 5 STEPS FOR ARCHITECTING A DATA LAKE | WHAT IS A DATA LAKE? DATA LAKE TRADITIONAL Data Storage Structured, semi-structured, or unstructured data can be stored at low costs and can be stored with a schema (e.g. relational) or can be schema-less. Data is stored in vertically scaling relational database management systems (RDBMS) at high costs. Advanced Analytics Analytics can be run on any and all data sets in real-time (e.g. in memory machine learning algorithms) without requiring upfront manual processing or preparation. Data typically has to be manually prepared and integrated from multiple sources, which can be a significant barrier to generating rapid insights. Enterprise Data Taxonomy Multiple taxonomies, schemas, and standards can exist in a single data environment while being applied by different data stakeholder groups. Agencies struggled in the past to create a single taxonomy or schema to represent the enterprise data model. User Access Control Data is tagged at ingestion (and automatically analyzed on read) with the appropriate authorization rules. Authentication can be controlled through single sign on (SSO) capabilities. Data authentication and authorization is specified using manually-controlled and disparate tools (e.g. Access Control Lists). Business Intelligence Information and analytics are conveyed using automated, feature-rich, dashboards and visualizations. Information and analytics are presented in compiled, static reports. Data Lake implementations using Big Data technologies like Hadoop, represent a transformational paradigm shift in the data enterprise objectives for agencies. This shift allows existing legacy or traditional approaches to data utilization to drastically advance forward.
  • 8. SECTION 3: DATA LAKE REQUIREMENTS © 2018 MetroStar Systems, Inc. - All Rights Reserved 8
  • 9. © 2018 MetroStar Systems, Inc. - All Rights Reserved 9 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS CREATING A SUCCESSFUL DATA LAKE Scaling the data value proposition of the Data Lake starts by making data accessible and easy to use. The Data Lake’s data consumers will have diverse needs, so using a common data storage and access infrastructure alongside a fully featured Cloud Service Provider (e.g., Amazon Web Services, Microsoft Azure, etc.) provides the capabilities and flexibilities needed to drive innovative uses of data and data services. Using best of breed open-source cloud architectures to overcome “vendor lock-in” challenges, for a Data Lake eliminates linkage maintenance of stove- piped systems, increases ease of data use, expedites delivery, and ultimately reduces the risks/costs associated with achieving innovation. A successful Data Lake implementation also allows data across the agency to be integrated and leveraged in a sophisticated solution, and begins with a modular, modern cluster-based (multiple interconnected servers) architecture that is grounded in a flexible infrastructure platform. A significant challenge when striving for innovative results is “vendor lock-in,” which is caused by proprietary commercial-off- the-shelf (COTS) technologies that make it difficult to modify, scale, or transition to new data uses/services.
  • 10. © 2018 MetroStar Systems, Inc. - All Rights Reserved 10 DATA LAKE GOVERNANCE 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS Without data lake governance, businesses could be left without meaningful business intelligence -- or even jeopardize the business.
  • 11. © 2018 MetroStar Systems, Inc. - All Rights Reserved 11 SELECTING THE RIGHT PLATFORM 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS Agencies have successfully used AWS to support workloads and solutions with data from Controlled Unclassified Information (CUI) to Top Secret classifications.  AWS Elastic MapReduce (EMR), a managed Hadoop, Spark, and Presto Solution  EMR Ingests with a number of AWS Services  AWS also has real-time analytics, predictive analytics, and data dashboard and visualization capabilities  AWS has been used to support government missions in health and human sciences, defense, intelligence, statistical, regulatory, and financial industries  Azure includes the managed Apache platform HDInsight (Hadoop, Spark, Storm, Hbase)  HDInsight includes a local Hadoop Distributed File System (HDFS), connected to the Data Lake  Azure Data Lake Store can store data in its native format, without prior transformations  Recently added Azure Data Lake Analytics, a serverless hyper-scale data storage and analytical platform  Fully managed Hadoop and Spark offering  Provides a fully programmable framework for Java and Python  Cloud Dataflow & Spark for pipeline execution  Machine Learning as a fully management platform for training and hosting  Google offers a Cloud Machine Learning Engine to build model based on TensorFlow’s deep learning library *Comparisons shown above based on August 2017 data
  • 12. SECTION 4: 5 STEPS FOR ARCHITECTING A DATA LAKE © 2018 MetroStar Systems, Inc. - All Rights Reserved 12
  • 13. © 2018 MetroStar Systems, Inc. - All Rights Reserved 13 1. INGESTION & STORAGE 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS DATA INGESTION To begin data ingestion, agencies must perform an analysis of the high value data sources present in the enterprise. These data sources are typically relational and/or transactional and offer quick-win opportunities to establish the Data Lake as the center for a single source of truth. The processes used to obtain and capture data can be iterated upon, and open source tools can reduce the complexities of data ingestion configuration. DATA STORAGE By developing a data pipeline, events called processors that can handle specific extract, transform, and load (ETL) processes on incoming data are implemented. For data that requires more advanced processing, native tools can help bridge the gap between data collection, data ETL (including applying governance policies and access control), and data storage. For storing data that is from relational sources, native technologies can be used. PROPER DATA INGESTION IS CRUCIAL TO THE SUCCESS OF A DATA LAKE. Understanding the velocity, size, format, and frequency of the data being ingested, and how it will be analyzed ensures the architecture properly accommodates data.
  • 14. © 2018 MetroStar Systems, Inc. - All Rights Reserved 14 2. DATA PROCESSING The processing capabilities of a Data Lake enable innovative and creative questioning to happen at speeds and scales never before seen in legacy data processing environments. Queries and workloads run across the Data Lake cluster of nodes as opposed to on single servers, which reduces the resources required by a single server. This maximizes the Data Lake’s ability to deliver results in a timely, streamlined way. The freedom and expressive ability of a Data Lake’s processing paradigms allows users to think beyond simply asking questions of single data sources (e.g., a query performed on a relational data store). Newer technologies allow entire datasets across the Data Lake to be loaded into the memory of the cluster, further reducing the time to compute heavy workloads, and delivering results up to 100 times faster. By decreasing the barriers of complexity to access, and extract value out of the agency’s data, the Data Lake’s processing paradigms advance the ability to gain new insights from the data. From challenges as simple as the word count of a dataset, to as complicated as processing streaming biometric information, no workload is too small, too large, too simple, or too complex to be performed inside the Data Lake. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  • 15. © 2018 MetroStar Systems, Inc. - All Rights Reserved 15 3. ROBUST DATA GOVERNANCE Data Lakes offer a single source of truth for an agency. Therefore, it’s imperative that the data is appropriately secured and only accessed by authorized individuals. Data accountability can be established by using a combination of native tools to ensure that users are only authorized to view and execute actions that are approved for their role. This accountability also allows security and audit specialists to easily evaluate the data configurations and operations across the Data Lake. In addition to restricting access, an important piece in the data and information access control strategy is implementing data governance, retention, and linage policies. Introducing these types of policies at the point of ingestion to the Data Lake automates an otherwise tedious and complicated process. Conducting stakeholder interviews to gain an understanding of target high-value data systems enables a holistic understanding of the taxonomies present in the enterprise, and establishes the data governance and access needs of the Data Lake. Governance combines quality, management, policy management, business process management, and risk management to ensure data is formally and properly managed. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  • 16. © 2018 MetroStar Systems, Inc. - All Rights Reserved 16 4. DATA RETRIEVAL & VISUALIZATION One of the most important components of a Data Lake is the ability to retrieve, analyze, visualize, and share insights derived from data. Communicating data visually is directly in line with the key pillars of a successful Data Lake. Legacy COTS reporting tools are not designed to provide the creative, captivating, and accessible analytics and insight desired by users. This means that the Data Lake’s tools must support the dynamic challenge of enabling users to easily prepare visually compelling data stories. As data-related problems grow in size and complexity, traditional reverse-engineered analysis methods that require pre-formulated hypotheses and data source/schema decisions become more expensive, less accurate, and too rigid for analysts to use to make timely decisions. Custom data visualization tools are well-suited to providing an agency with a platform to deliver visual reporting based on public data, which can be delivered right to a user’s email, via built-in automation features. Today’s data user is accustomed to interaction with apps and data on their personal devices via sophisticated user experiences and compelling visual narratives. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  • 17. © 2018 MetroStar Systems, Inc. - All Rights Reserved 17 5. ADVANCED ANALYTICS Traditionally, data science projects were incredibly costly due to the amount of resources needed to perform the analytical processing required by certain algorithms and processes. These barriers made the field of data science difficult to access, because a successful project was too expensive in both time and costs. However, with a Data Lake, the ability to process data at huge scales is now more readily available for data science applications. An agency can exploit the capabilities found in the Data Lake by using its cluster based data processing paradigms. Advanced analytical techniques commonly found in data science applications, can then be applied. These techniques include machine learning, natural language processing, image processing, data mining, predictive analytics, statistical analytics, and more. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  • 18. © 2018 MetroStar Systems, Inc. - All Rights Reserved 18 OVERVIEW OF A DATA LAKE’S CAPABILITIES Successfully implementing a Data Lake environment requires an advanced understanding of the analytical insight possibilities the holistic platform provides via its mixed ecosystem of cutting-edge open-source technologies and best-of-breed commercial software. Identifying the best approach for developing and implementing the components, and the end goal of the insights to be derived from a Data Lake is critical for architecting a successful environment. Incorporating best practices for analyzing, interpreting, and understanding data science-generated results to support data-driven decision making also helps ensure success. Best practices, coupled with building teams with skillsets in mathematics, computer science, and domain expertise to solve complex data challenges allows agencies to maximize data discovery, data-driven decision making, and return on analytics innovation. All of which is built on a foundation of standardized metadata, firm access protocols, intelligent discovery mechanisms, and a flexible data governance process to reduce data silos. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  • 19. SECTION 5: MAXIMIZING THE VALUE OF A DATA LAKE © 2018 MetroStar Systems, Inc. - All Rights Reserved 19
  • 20. © 2018 MetroStar Systems, Inc. - All Rights Reserved 20 DATA REVOLVES AROUND CITIZENS A Data Lake is only as powerful as the insights an agency is able to derive from its contents. Those insights are only as valuable as the agency’s ability to power change via them. This end state requires the ability for stakeholders and users to derive insights leveraging a Citizen Engagement Model (CEM) integration. Using a component driven design and development approach leveraging best practices from Human Centered Design and Agile principles will help agencies increase the usability, searchability, findability, and extensibility of their data. 5 STEPS FOR ARCHITECTING A DATA LAKE | MAXIMIZING THE VALUE OF A DATA LAKE
  • 21. © 2018 MetroStar Systems, Inc. - All Rights Reserved 21 ENHANCING CITIZEN EXPERIENCE By integrating the citizen- centric data lake with the CEM, agencies are able to gather new, valuable insights from previously siloed datasets. Those insights :  Enable quantitative assessment of changing customer needs and technological innovations  Identify metrics, KPIs, and requirements needed to build CEM dashboards  Identify additional data sources required  Improve relevancy of search index and recommendations related to structured and unstructured searches  Provide support to create, maintain, and improve loading process  Support configuration and maintenance of the current data environments 5 STEPS FOR ARCHITECTING A DATA LAKE | MAXIMIZING THE VALUE OF A DATA LAKE Properly architecting a Data Lake will provide agencies with numerous benefits including low-cost storage, custom configurations, unified enterprise data, and the ability to securely scale – all of which provide agencies with a unique competitive advantage.
  • 22. The delivery of the Data Lake does not end with architecting, deploying, integrating, and configuring the solution. The Data Lake is built on the concept of removing barriers to innovating with data, but without proper education delivered by expert practitioners in the field of Data Science, Big Data, and Cloud Computing, the opportunities the Data Lake enable cannot be fully recognized. Having a team of highly skilled experts supporting a Data Lake is pertinent to the realization of a fully functioning Data Lake. Our team, comprised of full-service data scientists have specializations across Big Data, large-scale data platforms, advanced analytics, mathematical modeling, and computer science are uniquely qualified to provide the level of educational care our customers require. We possess deep technical expertise in open source development technologies and containerization methods that bring efficiencies to development efforts and have a deep bench of software developer consultants bringing the greatest level of technical acumen and availability. Our team is not only an avid user and implementer of open source software, but has also given back to the open source community as active contributors to the Apache Accumulo, Hadoop, NiFi, and Mahout projects. ABOUT METROSTAR SYSTEMS MetroStar Systems has been a trusted partner, delivering leading-edge technology solutions to federal and defense agencies since 1999. MetroStar’s unique blend of cross- functional experts across three practice areas: Cybersecurity, Digital, and Enterprise IT, enables the successful delivery of transformative solutions. Learn more about our work implementing data lakes for federal agencies: https://www.metrostarsystems.com © 2018 MetroStar Systems, Inc. - All Rights Reserved 22 5 STEPS FOR ARCHITECTING A DATA LAKE | ASSESSING READINESS ASSESSING READINESS
  • 23. TO LEARN MORE ABOUT METROSTAR SYSTEMS: Contact: Debbie Peterson 1856 Old Reston Avenue, Suite 100 Reston, VA 20190 703.481.9581 dpeterson@metrostarsystems.com www.metrostarsystems.com © Copyright 2018 MetroStar Systems, Inc., This document is current as of the initial date of publication and may be changed by MetroStar Systems at any time. The performance data and examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions. THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.