SlideShare a Scribd company logo
Best Practices to Build a Data Lake
https://fibonalabs.com/
The need for big data is inevitable. Data is the new currency, and it is estimated
that 90% of the data in the world today has been created in the last two years
alone, with 2.5 quintillion bytes of data created every day. With this amount of
data being created, companies are facing greater challenges to ensure that
they are using their data in the best way possible, out of which creating a Data
Lake is one such method.
A Data Lake is a vast pool of raw data that comprises structured and
unstructured data. This data can be processed and analyzed later on. Data
Lakes eliminates the need for implementing traditional database architectures.
This blog post will discuss the best practices for building a data lake. So,
without further ado, let’s get started.
BEST PRACTICES TO BUILD A DATA LAKE
1. REGULATION OF DATA INGESTION
Data ingestion is “the flow of data from its origin to data stores such as data
lakes, databases and search engines”. As we add new data into the data lake,
it is important to preserve the data in its native form. By doing so, we can
generate outputs of analysis and predictions with greater accuracy. This
includes preserving even the null values of the data, out of which proficient data
scientists squeeze out analytical values when needed.
WHEN SHOULD WE PERFORM DATA AGGREGATION?
Aggregation can be carried out when there is PII (Personally Identifiable
Information) present in the data source.
The PII can be replaced with a Unique ID before the sources are saved to the
data lake. This bridges the gap between protecting user privacy and the
availability of data for analytical purposes. It also ensures compliance with data
regulations like GDPR, CCPA, and HIPAA, etc.
2. DESIGNING THE RIGHT DATA TRANSFORMATION IDEA
The main purpose of collecting data in Data Lake is to perform operations like
inspection, exploration, and analysis. If the data is not transformed and
cataloged correctly, it increases the workload on the analytical engines. The
analytical engines scan the entire data set across multiple files, which often
results in query overheads.
MEASURES TO HELP IN DESIGNING THE RIGHT DATA
TRANSFORMATION STRATEGY:
● Store the data in a columnar format such as Apache Parquet or ORC,
these formats offer optimized reads and are open-source, which increases
the availability of data for various analytical services.
● Partitioning the data concerning the time stamp can have a great impact
on search performance.
● Small files can be chunked into bigger ones asynchronously. This helps in
reducing network overheads.
● Using Z-order indexed materialized views would help to serve queries
including data stored in multiple columns.
● Collect data set statistics like file size, rows, histogram of values to
● Collect column and table statistics to estimate predicate selectivity and
cost of plans. It also helps to perform certain advanced rewrites in the Data
Lake.
3. PRIORITISING SECURITY IN A DATA LAKE
The RSA Data Privacy and Security survey conducted in 2019 revealed that
64% of its US respondents and 72% of its UK respondents blamed the
company and not the hacker for the loss of personal data. This is due to the
lack of fine-grained access control mechanisms in the data lake. Along with the
increase of data, tools, and users, there is a dynamic increase in the risks of
security breaches. Hence curating a security strategy even before building a
data lake is important. This would grab the attention of the increased agility that
comes with the use of a data lake.
The data lake security protocols must account for compliance with major
security policies.
POINTS TO REMEMBER WHILE CURATING AN EFFICIENT SECURITY
STRATEGY:
● Authentication and authorization of the users who access the data lake
must be enforced. For instance, person A might have access to edit the
data lake whereas person B might have permission only to view it. They
must be authenticated using passwords, usernames, multiple device
authentication, etc. Integrating a strong ID management tool in the
underlying Cloud Solutions provider would help in achieving this.
● The data should be encrypted at all levels i.e., when in transit and also at
rest so that only the intended users can understand and use it.
● Access should be granted only to skilled and well-experienced
administrators, thus minimizing the risk of breaches.
● The data lake platform must be hardened so that its functions are isolated
from the other existing cloud services.
● Host security methods such as host intrusion detection, file integrity
monitoring, and log management should be enhanced.
● Redundant copies of critical data must be stored as a backup option in
another data lake so that it comes in hand in cases of data corruption or
accidental deletion.
4. IMPLEMENTING WELL-FORMULATED DATA GOVERNANCE
STRATEGIES
A good data governance strategy ensures data quality and consistency.
It prevents the data lake from becoming an unmanageable data swamp.
KEY POINTS TO REMEMBER WHILE CRAFTING A GOVERNANCE
STRATEGY FOR A DATA LAKE:
● Data should be identified and cataloged. The sensitive data must be
clearly labeled. This would help the users achieve better search results.
● Creating metadata acts as a tagging system to organize data and assist
people during their search for different types of data without confusion.
● No data should be stored beyond the time specified in the compliance
protocols. This would result in cost issues along with compliance protocol
violations. So, defining proper retention policies for the data is necessary.
THANK YOU

More Related Content

Similar to Best Practices To Build a Data Lake

Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
Shallote Dsouza
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data Lake
IRJET Journal
 
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Con LA
 
Snowflake Time Travel.pdf
Snowflake Time Travel.pdfSnowflake Time Travel.pdf
Snowflake Time Travel.pdf
VishnuGone
 
The CIO guide to Big Data Archiving
The CIO guide to Big Data ArchivingThe CIO guide to Big Data Archiving
The CIO guide to Big Data Archiving
LindaWatson19
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
Databricks
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
rajsharma159890
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
SwarnaLatha177
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
Polestarsolutions
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform Strategy
Databricks
 
Gdpr ccpa automated compliance - spark java application features and functi...
Gdpr   ccpa automated compliance - spark java application features and functi...Gdpr   ccpa automated compliance - spark java application features and functi...
Gdpr ccpa automated compliance - spark java application features and functi...
Steven Meister
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Srikanth Sharma Boddupalli
 
Intro to big data and applications -day 3
Intro to big data and applications -day 3Intro to big data and applications -day 3
Intro to big data and applications -day 3
Parviz Vakili
 
Group 2 Handling and Processing of big data (1).pptx
Group 2 Handling and Processing of big data (1).pptxGroup 2 Handling and Processing of big data (1).pptx
Group 2 Handling and Processing of big data (1).pptx
NATASHABANO
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
ssuser96aab9
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
NouhaElhaji1
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
Arvind Bhisikar
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - Presentation
Srinivasan Sankar
 
Data Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data InsightData Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data Insight
Symantec
 

Similar to Best Practices To Build a Data Lake (20)

Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
 
Abstract
AbstractAbstract
Abstract
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data Lake
 
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...
 
Snowflake Time Travel.pdf
Snowflake Time Travel.pdfSnowflake Time Travel.pdf
Snowflake Time Travel.pdf
 
The CIO guide to Big Data Archiving
The CIO guide to Big Data ArchivingThe CIO guide to Big Data Archiving
The CIO guide to Big Data Archiving
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform Strategy
 
Gdpr ccpa automated compliance - spark java application features and functi...
Gdpr   ccpa automated compliance - spark java application features and functi...Gdpr   ccpa automated compliance - spark java application features and functi...
Gdpr ccpa automated compliance - spark java application features and functi...
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
 
Intro to big data and applications -day 3
Intro to big data and applications -day 3Intro to big data and applications -day 3
Intro to big data and applications -day 3
 
Group 2 Handling and Processing of big data (1).pptx
Group 2 Handling and Processing of big data (1).pptxGroup 2 Handling and Processing of big data (1).pptx
Group 2 Handling and Processing of big data (1).pptx
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - Presentation
 
Data Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data InsightData Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data Insight
 

More from Fibonalabs

Data Sharing Between Child and Parent Components in AngularJS
Data Sharing Between Child and Parent Components in AngularJSData Sharing Between Child and Parent Components in AngularJS
Data Sharing Between Child and Parent Components in AngularJS
Fibonalabs
 
A Complete Guide to Building a Ground-Breaking UX Design Strategy
A Complete Guide to Building a Ground-Breaking UX Design StrategyA Complete Guide to Building a Ground-Breaking UX Design Strategy
A Complete Guide to Building a Ground-Breaking UX Design Strategy
Fibonalabs
 
React Class Components vs Functional Components: Which is Better?
React Class Components vs Functional Components: Which is Better?React Class Components vs Functional Components: Which is Better?
React Class Components vs Functional Components: Which is Better?
Fibonalabs
 
Measures to ensure Cyber Security in a serverless environment
Measures to ensure Cyber Security in a serverless environmentMeasures to ensure Cyber Security in a serverless environment
Measures to ensure Cyber Security in a serverless environment
Fibonalabs
 
Simplifying CRUD operations using budibase
Simplifying CRUD operations using budibaseSimplifying CRUD operations using budibase
Simplifying CRUD operations using budibase
Fibonalabs
 
How to implement Micro-frontends using Qiankun
How to implement Micro-frontends using QiankunHow to implement Micro-frontends using Qiankun
How to implement Micro-frontends using Qiankun
Fibonalabs
 
Different Cloud Computing Services Used At Fibonalabs
Different Cloud Computing Services Used At FibonalabsDifferent Cloud Computing Services Used At Fibonalabs
Different Cloud Computing Services Used At Fibonalabs
Fibonalabs
 
How Can A Startup Benefit From Collaborating With A UX Design Partner
How Can A Startup Benefit From Collaborating With A UX Design PartnerHow Can A Startup Benefit From Collaborating With A UX Design Partner
How Can A Startup Benefit From Collaborating With A UX Design Partner
Fibonalabs
 
How to make React Applications SEO-friendly
How to make React Applications SEO-friendlyHow to make React Applications SEO-friendly
How to make React Applications SEO-friendly
Fibonalabs
 
10 Heuristic Principles
10 Heuristic Principles10 Heuristic Principles
10 Heuristic Principles
Fibonalabs
 
Push Notifications: How to add them to a Flutter App
Push Notifications: How to add them to a Flutter AppPush Notifications: How to add them to a Flutter App
Push Notifications: How to add them to a Flutter App
Fibonalabs
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data Engineering
Fibonalabs
 
Ways for UX Design Iterations: Innovate Faster & Better
Ways for UX Design Iterations: Innovate Faster & BetterWays for UX Design Iterations: Innovate Faster & Better
Ways for UX Design Iterations: Innovate Faster & Better
Fibonalabs
 
Factors that could impact conversion rate in UX Design
Factors that could impact conversion rate in UX DesignFactors that could impact conversion rate in UX Design
Factors that could impact conversion rate in UX Design
Fibonalabs
 
Information Architecture in UX: To offer Delightful and Meaningful User Exper...
Information Architecture in UX: To offer Delightful and Meaningful User Exper...Information Architecture in UX: To offer Delightful and Meaningful User Exper...
Information Architecture in UX: To offer Delightful and Meaningful User Exper...
Fibonalabs
 
Cloud Computing Architecture: Components, Importance, and Tips
Cloud Computing Architecture: Components, Importance, and TipsCloud Computing Architecture: Components, Importance, and Tips
Cloud Computing Architecture: Components, Importance, and Tips
Fibonalabs
 
Choose the Best Agile Product Development Method for a Successful Business
Choose the Best Agile Product Development Method for a Successful BusinessChoose the Best Agile Product Development Method for a Successful Business
Choose the Best Agile Product Development Method for a Successful Business
Fibonalabs
 
Atomic Design: Effective Way of Designing UI
Atomic Design: Effective Way of Designing UIAtomic Design: Effective Way of Designing UI
Atomic Design: Effective Way of Designing UI
Fibonalabs
 
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
Fibonalabs
 
7 Psychology Theories in UX to Provide Better User Experience
7 Psychology Theories in UX to Provide Better User Experience7 Psychology Theories in UX to Provide Better User Experience
7 Psychology Theories in UX to Provide Better User Experience
Fibonalabs
 

More from Fibonalabs (20)

Data Sharing Between Child and Parent Components in AngularJS
Data Sharing Between Child and Parent Components in AngularJSData Sharing Between Child and Parent Components in AngularJS
Data Sharing Between Child and Parent Components in AngularJS
 
A Complete Guide to Building a Ground-Breaking UX Design Strategy
A Complete Guide to Building a Ground-Breaking UX Design StrategyA Complete Guide to Building a Ground-Breaking UX Design Strategy
A Complete Guide to Building a Ground-Breaking UX Design Strategy
 
React Class Components vs Functional Components: Which is Better?
React Class Components vs Functional Components: Which is Better?React Class Components vs Functional Components: Which is Better?
React Class Components vs Functional Components: Which is Better?
 
Measures to ensure Cyber Security in a serverless environment
Measures to ensure Cyber Security in a serverless environmentMeasures to ensure Cyber Security in a serverless environment
Measures to ensure Cyber Security in a serverless environment
 
Simplifying CRUD operations using budibase
Simplifying CRUD operations using budibaseSimplifying CRUD operations using budibase
Simplifying CRUD operations using budibase
 
How to implement Micro-frontends using Qiankun
How to implement Micro-frontends using QiankunHow to implement Micro-frontends using Qiankun
How to implement Micro-frontends using Qiankun
 
Different Cloud Computing Services Used At Fibonalabs
Different Cloud Computing Services Used At FibonalabsDifferent Cloud Computing Services Used At Fibonalabs
Different Cloud Computing Services Used At Fibonalabs
 
How Can A Startup Benefit From Collaborating With A UX Design Partner
How Can A Startup Benefit From Collaborating With A UX Design PartnerHow Can A Startup Benefit From Collaborating With A UX Design Partner
How Can A Startup Benefit From Collaborating With A UX Design Partner
 
How to make React Applications SEO-friendly
How to make React Applications SEO-friendlyHow to make React Applications SEO-friendly
How to make React Applications SEO-friendly
 
10 Heuristic Principles
10 Heuristic Principles10 Heuristic Principles
10 Heuristic Principles
 
Push Notifications: How to add them to a Flutter App
Push Notifications: How to add them to a Flutter AppPush Notifications: How to add them to a Flutter App
Push Notifications: How to add them to a Flutter App
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data Engineering
 
Ways for UX Design Iterations: Innovate Faster & Better
Ways for UX Design Iterations: Innovate Faster & BetterWays for UX Design Iterations: Innovate Faster & Better
Ways for UX Design Iterations: Innovate Faster & Better
 
Factors that could impact conversion rate in UX Design
Factors that could impact conversion rate in UX DesignFactors that could impact conversion rate in UX Design
Factors that could impact conversion rate in UX Design
 
Information Architecture in UX: To offer Delightful and Meaningful User Exper...
Information Architecture in UX: To offer Delightful and Meaningful User Exper...Information Architecture in UX: To offer Delightful and Meaningful User Exper...
Information Architecture in UX: To offer Delightful and Meaningful User Exper...
 
Cloud Computing Architecture: Components, Importance, and Tips
Cloud Computing Architecture: Components, Importance, and TipsCloud Computing Architecture: Components, Importance, and Tips
Cloud Computing Architecture: Components, Importance, and Tips
 
Choose the Best Agile Product Development Method for a Successful Business
Choose the Best Agile Product Development Method for a Successful BusinessChoose the Best Agile Product Development Method for a Successful Business
Choose the Best Agile Product Development Method for a Successful Business
 
Atomic Design: Effective Way of Designing UI
Atomic Design: Effective Way of Designing UIAtomic Design: Effective Way of Designing UI
Atomic Design: Effective Way of Designing UI
 
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
 
7 Psychology Theories in UX to Provide Better User Experience
7 Psychology Theories in UX to Provide Better User Experience7 Psychology Theories in UX to Provide Better User Experience
7 Psychology Theories in UX to Provide Better User Experience
 

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Best Practices To Build a Data Lake

  • 1. Best Practices to Build a Data Lake https://fibonalabs.com/
  • 2.
  • 3. The need for big data is inevitable. Data is the new currency, and it is estimated that 90% of the data in the world today has been created in the last two years alone, with 2.5 quintillion bytes of data created every day. With this amount of data being created, companies are facing greater challenges to ensure that they are using their data in the best way possible, out of which creating a Data Lake is one such method. A Data Lake is a vast pool of raw data that comprises structured and unstructured data. This data can be processed and analyzed later on. Data Lakes eliminates the need for implementing traditional database architectures. This blog post will discuss the best practices for building a data lake. So, without further ado, let’s get started.
  • 4. BEST PRACTICES TO BUILD A DATA LAKE 1. REGULATION OF DATA INGESTION Data ingestion is “the flow of data from its origin to data stores such as data lakes, databases and search engines”. As we add new data into the data lake, it is important to preserve the data in its native form. By doing so, we can generate outputs of analysis and predictions with greater accuracy. This includes preserving even the null values of the data, out of which proficient data scientists squeeze out analytical values when needed. WHEN SHOULD WE PERFORM DATA AGGREGATION? Aggregation can be carried out when there is PII (Personally Identifiable Information) present in the data source.
  • 5. The PII can be replaced with a Unique ID before the sources are saved to the data lake. This bridges the gap between protecting user privacy and the availability of data for analytical purposes. It also ensures compliance with data regulations like GDPR, CCPA, and HIPAA, etc. 2. DESIGNING THE RIGHT DATA TRANSFORMATION IDEA The main purpose of collecting data in Data Lake is to perform operations like inspection, exploration, and analysis. If the data is not transformed and cataloged correctly, it increases the workload on the analytical engines. The analytical engines scan the entire data set across multiple files, which often results in query overheads.
  • 6. MEASURES TO HELP IN DESIGNING THE RIGHT DATA TRANSFORMATION STRATEGY: ● Store the data in a columnar format such as Apache Parquet or ORC, these formats offer optimized reads and are open-source, which increases the availability of data for various analytical services. ● Partitioning the data concerning the time stamp can have a great impact on search performance. ● Small files can be chunked into bigger ones asynchronously. This helps in reducing network overheads. ● Using Z-order indexed materialized views would help to serve queries including data stored in multiple columns. ● Collect data set statistics like file size, rows, histogram of values to
  • 7. ● Collect column and table statistics to estimate predicate selectivity and cost of plans. It also helps to perform certain advanced rewrites in the Data Lake. 3. PRIORITISING SECURITY IN A DATA LAKE The RSA Data Privacy and Security survey conducted in 2019 revealed that 64% of its US respondents and 72% of its UK respondents blamed the company and not the hacker for the loss of personal data. This is due to the lack of fine-grained access control mechanisms in the data lake. Along with the increase of data, tools, and users, there is a dynamic increase in the risks of security breaches. Hence curating a security strategy even before building a data lake is important. This would grab the attention of the increased agility that comes with the use of a data lake.
  • 8. The data lake security protocols must account for compliance with major security policies. POINTS TO REMEMBER WHILE CURATING AN EFFICIENT SECURITY STRATEGY: ● Authentication and authorization of the users who access the data lake must be enforced. For instance, person A might have access to edit the data lake whereas person B might have permission only to view it. They must be authenticated using passwords, usernames, multiple device authentication, etc. Integrating a strong ID management tool in the underlying Cloud Solutions provider would help in achieving this. ● The data should be encrypted at all levels i.e., when in transit and also at rest so that only the intended users can understand and use it.
  • 9. ● Access should be granted only to skilled and well-experienced administrators, thus minimizing the risk of breaches. ● The data lake platform must be hardened so that its functions are isolated from the other existing cloud services. ● Host security methods such as host intrusion detection, file integrity monitoring, and log management should be enhanced. ● Redundant copies of critical data must be stored as a backup option in another data lake so that it comes in hand in cases of data corruption or accidental deletion. 4. IMPLEMENTING WELL-FORMULATED DATA GOVERNANCE STRATEGIES A good data governance strategy ensures data quality and consistency.
  • 10. It prevents the data lake from becoming an unmanageable data swamp. KEY POINTS TO REMEMBER WHILE CRAFTING A GOVERNANCE STRATEGY FOR A DATA LAKE: ● Data should be identified and cataloged. The sensitive data must be clearly labeled. This would help the users achieve better search results. ● Creating metadata acts as a tagging system to organize data and assist people during their search for different types of data without confusion. ● No data should be stored beyond the time specified in the compliance protocols. This would result in cost issues along with compliance protocol violations. So, defining proper retention policies for the data is necessary.