SlideShare a Scribd company logo
1 of 11
Best Practices to Build a Data Lake
https://fibonalabs.com/
The need for big data is inevitable. Data is the new currency, and it is estimated
that 90% of the data in the world today has been created in the last two years
alone, with 2.5 quintillion bytes of data created every day. With this amount of
data being created, companies are facing greater challenges to ensure that
they are using their data in the best way possible, out of which creating a Data
Lake is one such method.
A Data Lake is a vast pool of raw data that comprises structured and
unstructured data. This data can be processed and analyzed later on. Data
Lakes eliminates the need for implementing traditional database architectures.
This blog post will discuss the best practices for building a data lake. So,
without further ado, let’s get started.
BEST PRACTICES TO BUILD A DATA LAKE
1. REGULATION OF DATA INGESTION
Data ingestion is “the flow of data from its origin to data stores such as data
lakes, databases and search engines”. As we add new data into the data lake,
it is important to preserve the data in its native form. By doing so, we can
generate outputs of analysis and predictions with greater accuracy. This
includes preserving even the null values of the data, out of which proficient data
scientists squeeze out analytical values when needed.
WHEN SHOULD WE PERFORM DATA AGGREGATION?
Aggregation can be carried out when there is PII (Personally Identifiable
Information) present in the data source.
The PII can be replaced with a Unique ID before the sources are saved to the
data lake. This bridges the gap between protecting user privacy and the
availability of data for analytical purposes. It also ensures compliance with data
regulations like GDPR, CCPA, and HIPAA, etc.
2. DESIGNING THE RIGHT DATA TRANSFORMATION IDEA
The main purpose of collecting data in Data Lake is to perform operations like
inspection, exploration, and analysis. If the data is not transformed and
cataloged correctly, it increases the workload on the analytical engines. The
analytical engines scan the entire data set across multiple files, which often
results in query overheads.
MEASURES TO HELP IN DESIGNING THE RIGHT DATA
TRANSFORMATION STRATEGY:
● Store the data in a columnar format such as Apache Parquet or ORC,
these formats offer optimized reads and are open-source, which increases
the availability of data for various analytical services.
● Partitioning the data concerning the time stamp can have a great impact
on search performance.
● Small files can be chunked into bigger ones asynchronously. This helps in
reducing network overheads.
● Using Z-order indexed materialized views would help to serve queries
including data stored in multiple columns.
● Collect data set statistics like file size, rows, histogram of values to
● Collect column and table statistics to estimate predicate selectivity and
cost of plans. It also helps to perform certain advanced rewrites in the Data
Lake.
3. PRIORITISING SECURITY IN A DATA LAKE
The RSA Data Privacy and Security survey conducted in 2019 revealed that
64% of its US respondents and 72% of its UK respondents blamed the
company and not the hacker for the loss of personal data. This is due to the
lack of fine-grained access control mechanisms in the data lake. Along with the
increase of data, tools, and users, there is a dynamic increase in the risks of
security breaches. Hence curating a security strategy even before building a
data lake is important. This would grab the attention of the increased agility that
comes with the use of a data lake.
The data lake security protocols must account for compliance with major
security policies.
POINTS TO REMEMBER WHILE CURATING AN EFFICIENT SECURITY
STRATEGY:
● Authentication and authorization of the users who access the data lake
must be enforced. For instance, person A might have access to edit the
data lake whereas person B might have permission only to view it. They
must be authenticated using passwords, usernames, multiple device
authentication, etc. Integrating a strong ID management tool in the
underlying Cloud Solutions provider would help in achieving this.
● The data should be encrypted at all levels i.e., when in transit and also at
rest so that only the intended users can understand and use it.
● Access should be granted only to skilled and well-experienced
administrators, thus minimizing the risk of breaches.
● The data lake platform must be hardened so that its functions are isolated
from the other existing cloud services.
● Host security methods such as host intrusion detection, file integrity
monitoring, and log management should be enhanced.
● Redundant copies of critical data must be stored as a backup option in
another data lake so that it comes in hand in cases of data corruption or
accidental deletion.
4. IMPLEMENTING WELL-FORMULATED DATA GOVERNANCE
STRATEGIES
A good data governance strategy ensures data quality and consistency.
It prevents the data lake from becoming an unmanageable data swamp.
KEY POINTS TO REMEMBER WHILE CRAFTING A GOVERNANCE
STRATEGY FOR A DATA LAKE:
● Data should be identified and cataloged. The sensitive data must be
clearly labeled. This would help the users achieve better search results.
● Creating metadata acts as a tagging system to organize data and assist
people during their search for different types of data without confusion.
● No data should be stored beyond the time specified in the compliance
protocols. This would result in cost issues along with compliance protocol
violations. So, defining proper retention policies for the data is necessary.
THANK YOU

More Related Content

Similar to Best Practices To Build a Data Lake

Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data Shallote Dsouza
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data LakeIRJET Journal
 
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...Data Con LA
 
Snowflake Time Travel.pdf
Snowflake Time Travel.pdfSnowflake Time Travel.pdf
Snowflake Time Travel.pdfVishnuGone
 
The CIO guide to Big Data Archiving
The CIO guide to Big Data ArchivingThe CIO guide to Big Data Archiving
The CIO guide to Big Data ArchivingLindaWatson19
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationDatabricks
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyDatabricks
 
Gdpr ccpa automated compliance - spark java application features and functi...
Gdpr   ccpa automated compliance - spark java application features and functi...Gdpr   ccpa automated compliance - spark java application features and functi...
Gdpr ccpa automated compliance - spark java application features and functi...Steven Meister
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineSrikanth Sharma Boddupalli
 
Intro to big data and applications -day 3
Intro to big data and applications -day 3Intro to big data and applications -day 3
Intro to big data and applications -day 3Parviz Vakili
 
Group 2 Handling and Processing of big data (1).pptx
Group 2 Handling and Processing of big data (1).pptxGroup 2 Handling and Processing of big data (1).pptx
Group 2 Handling and Processing of big data (1).pptxNATASHABANO
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationSrinivasan Sankar
 
Data Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data InsightData Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data InsightSymantec
 

Similar to Best Practices To Build a Data Lake (20)

Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
 
Abstract
AbstractAbstract
Abstract
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data Lake
 
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...
 
Snowflake Time Travel.pdf
Snowflake Time Travel.pdfSnowflake Time Travel.pdf
Snowflake Time Travel.pdf
 
The CIO guide to Big Data Archiving
The CIO guide to Big Data ArchivingThe CIO guide to Big Data Archiving
The CIO guide to Big Data Archiving
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform Strategy
 
Gdpr ccpa automated compliance - spark java application features and functi...
Gdpr   ccpa automated compliance - spark java application features and functi...Gdpr   ccpa automated compliance - spark java application features and functi...
Gdpr ccpa automated compliance - spark java application features and functi...
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
 
Intro to big data and applications -day 3
Intro to big data and applications -day 3Intro to big data and applications -day 3
Intro to big data and applications -day 3
 
Group 2 Handling and Processing of big data (1).pptx
Group 2 Handling and Processing of big data (1).pptxGroup 2 Handling and Processing of big data (1).pptx
Group 2 Handling and Processing of big data (1).pptx
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - Presentation
 
Data Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data InsightData Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data Insight
 

More from Fibonalabs

Data Sharing Between Child and Parent Components in AngularJS
Data Sharing Between Child and Parent Components in AngularJSData Sharing Between Child and Parent Components in AngularJS
Data Sharing Between Child and Parent Components in AngularJSFibonalabs
 
A Complete Guide to Building a Ground-Breaking UX Design Strategy
A Complete Guide to Building a Ground-Breaking UX Design StrategyA Complete Guide to Building a Ground-Breaking UX Design Strategy
A Complete Guide to Building a Ground-Breaking UX Design StrategyFibonalabs
 
React Class Components vs Functional Components: Which is Better?
React Class Components vs Functional Components: Which is Better?React Class Components vs Functional Components: Which is Better?
React Class Components vs Functional Components: Which is Better?Fibonalabs
 
Measures to ensure Cyber Security in a serverless environment
Measures to ensure Cyber Security in a serverless environmentMeasures to ensure Cyber Security in a serverless environment
Measures to ensure Cyber Security in a serverless environmentFibonalabs
 
Simplifying CRUD operations using budibase
Simplifying CRUD operations using budibaseSimplifying CRUD operations using budibase
Simplifying CRUD operations using budibaseFibonalabs
 
How to implement Micro-frontends using Qiankun
How to implement Micro-frontends using QiankunHow to implement Micro-frontends using Qiankun
How to implement Micro-frontends using QiankunFibonalabs
 
Different Cloud Computing Services Used At Fibonalabs
Different Cloud Computing Services Used At FibonalabsDifferent Cloud Computing Services Used At Fibonalabs
Different Cloud Computing Services Used At FibonalabsFibonalabs
 
How Can A Startup Benefit From Collaborating With A UX Design Partner
How Can A Startup Benefit From Collaborating With A UX Design PartnerHow Can A Startup Benefit From Collaborating With A UX Design Partner
How Can A Startup Benefit From Collaborating With A UX Design PartnerFibonalabs
 
How to make React Applications SEO-friendly
How to make React Applications SEO-friendlyHow to make React Applications SEO-friendly
How to make React Applications SEO-friendlyFibonalabs
 
10 Heuristic Principles
10 Heuristic Principles10 Heuristic Principles
10 Heuristic PrinciplesFibonalabs
 
Push Notifications: How to add them to a Flutter App
Push Notifications: How to add them to a Flutter AppPush Notifications: How to add them to a Flutter App
Push Notifications: How to add them to a Flutter AppFibonalabs
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data EngineeringFibonalabs
 
Ways for UX Design Iterations: Innovate Faster & Better
Ways for UX Design Iterations: Innovate Faster & BetterWays for UX Design Iterations: Innovate Faster & Better
Ways for UX Design Iterations: Innovate Faster & BetterFibonalabs
 
Factors that could impact conversion rate in UX Design
Factors that could impact conversion rate in UX DesignFactors that could impact conversion rate in UX Design
Factors that could impact conversion rate in UX DesignFibonalabs
 
Information Architecture in UX: To offer Delightful and Meaningful User Exper...
Information Architecture in UX: To offer Delightful and Meaningful User Exper...Information Architecture in UX: To offer Delightful and Meaningful User Exper...
Information Architecture in UX: To offer Delightful and Meaningful User Exper...Fibonalabs
 
Cloud Computing Architecture: Components, Importance, and Tips
Cloud Computing Architecture: Components, Importance, and TipsCloud Computing Architecture: Components, Importance, and Tips
Cloud Computing Architecture: Components, Importance, and TipsFibonalabs
 
Choose the Best Agile Product Development Method for a Successful Business
Choose the Best Agile Product Development Method for a Successful BusinessChoose the Best Agile Product Development Method for a Successful Business
Choose the Best Agile Product Development Method for a Successful BusinessFibonalabs
 
Atomic Design: Effective Way of Designing UI
Atomic Design: Effective Way of Designing UIAtomic Design: Effective Way of Designing UI
Atomic Design: Effective Way of Designing UIFibonalabs
 
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...Fibonalabs
 
7 Psychology Theories in UX to Provide Better User Experience
7 Psychology Theories in UX to Provide Better User Experience7 Psychology Theories in UX to Provide Better User Experience
7 Psychology Theories in UX to Provide Better User ExperienceFibonalabs
 

More from Fibonalabs (20)

Data Sharing Between Child and Parent Components in AngularJS
Data Sharing Between Child and Parent Components in AngularJSData Sharing Between Child and Parent Components in AngularJS
Data Sharing Between Child and Parent Components in AngularJS
 
A Complete Guide to Building a Ground-Breaking UX Design Strategy
A Complete Guide to Building a Ground-Breaking UX Design StrategyA Complete Guide to Building a Ground-Breaking UX Design Strategy
A Complete Guide to Building a Ground-Breaking UX Design Strategy
 
React Class Components vs Functional Components: Which is Better?
React Class Components vs Functional Components: Which is Better?React Class Components vs Functional Components: Which is Better?
React Class Components vs Functional Components: Which is Better?
 
Measures to ensure Cyber Security in a serverless environment
Measures to ensure Cyber Security in a serverless environmentMeasures to ensure Cyber Security in a serverless environment
Measures to ensure Cyber Security in a serverless environment
 
Simplifying CRUD operations using budibase
Simplifying CRUD operations using budibaseSimplifying CRUD operations using budibase
Simplifying CRUD operations using budibase
 
How to implement Micro-frontends using Qiankun
How to implement Micro-frontends using QiankunHow to implement Micro-frontends using Qiankun
How to implement Micro-frontends using Qiankun
 
Different Cloud Computing Services Used At Fibonalabs
Different Cloud Computing Services Used At FibonalabsDifferent Cloud Computing Services Used At Fibonalabs
Different Cloud Computing Services Used At Fibonalabs
 
How Can A Startup Benefit From Collaborating With A UX Design Partner
How Can A Startup Benefit From Collaborating With A UX Design PartnerHow Can A Startup Benefit From Collaborating With A UX Design Partner
How Can A Startup Benefit From Collaborating With A UX Design Partner
 
How to make React Applications SEO-friendly
How to make React Applications SEO-friendlyHow to make React Applications SEO-friendly
How to make React Applications SEO-friendly
 
10 Heuristic Principles
10 Heuristic Principles10 Heuristic Principles
10 Heuristic Principles
 
Push Notifications: How to add them to a Flutter App
Push Notifications: How to add them to a Flutter AppPush Notifications: How to add them to a Flutter App
Push Notifications: How to add them to a Flutter App
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data Engineering
 
Ways for UX Design Iterations: Innovate Faster & Better
Ways for UX Design Iterations: Innovate Faster & BetterWays for UX Design Iterations: Innovate Faster & Better
Ways for UX Design Iterations: Innovate Faster & Better
 
Factors that could impact conversion rate in UX Design
Factors that could impact conversion rate in UX DesignFactors that could impact conversion rate in UX Design
Factors that could impact conversion rate in UX Design
 
Information Architecture in UX: To offer Delightful and Meaningful User Exper...
Information Architecture in UX: To offer Delightful and Meaningful User Exper...Information Architecture in UX: To offer Delightful and Meaningful User Exper...
Information Architecture in UX: To offer Delightful and Meaningful User Exper...
 
Cloud Computing Architecture: Components, Importance, and Tips
Cloud Computing Architecture: Components, Importance, and TipsCloud Computing Architecture: Components, Importance, and Tips
Cloud Computing Architecture: Components, Importance, and Tips
 
Choose the Best Agile Product Development Method for a Successful Business
Choose the Best Agile Product Development Method for a Successful BusinessChoose the Best Agile Product Development Method for a Successful Business
Choose the Best Agile Product Development Method for a Successful Business
 
Atomic Design: Effective Way of Designing UI
Atomic Design: Effective Way of Designing UIAtomic Design: Effective Way of Designing UI
Atomic Design: Effective Way of Designing UI
 
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
Agile Software Development with Scrum_ A Complete Guide to The Steps in Agile...
 
7 Psychology Theories in UX to Provide Better User Experience
7 Psychology Theories in UX to Provide Better User Experience7 Psychology Theories in UX to Provide Better User Experience
7 Psychology Theories in UX to Provide Better User Experience
 

Recently uploaded

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Best Practices To Build a Data Lake

  • 1. Best Practices to Build a Data Lake https://fibonalabs.com/
  • 2.
  • 3. The need for big data is inevitable. Data is the new currency, and it is estimated that 90% of the data in the world today has been created in the last two years alone, with 2.5 quintillion bytes of data created every day. With this amount of data being created, companies are facing greater challenges to ensure that they are using their data in the best way possible, out of which creating a Data Lake is one such method. A Data Lake is a vast pool of raw data that comprises structured and unstructured data. This data can be processed and analyzed later on. Data Lakes eliminates the need for implementing traditional database architectures. This blog post will discuss the best practices for building a data lake. So, without further ado, let’s get started.
  • 4. BEST PRACTICES TO BUILD A DATA LAKE 1. REGULATION OF DATA INGESTION Data ingestion is “the flow of data from its origin to data stores such as data lakes, databases and search engines”. As we add new data into the data lake, it is important to preserve the data in its native form. By doing so, we can generate outputs of analysis and predictions with greater accuracy. This includes preserving even the null values of the data, out of which proficient data scientists squeeze out analytical values when needed. WHEN SHOULD WE PERFORM DATA AGGREGATION? Aggregation can be carried out when there is PII (Personally Identifiable Information) present in the data source.
  • 5. The PII can be replaced with a Unique ID before the sources are saved to the data lake. This bridges the gap between protecting user privacy and the availability of data for analytical purposes. It also ensures compliance with data regulations like GDPR, CCPA, and HIPAA, etc. 2. DESIGNING THE RIGHT DATA TRANSFORMATION IDEA The main purpose of collecting data in Data Lake is to perform operations like inspection, exploration, and analysis. If the data is not transformed and cataloged correctly, it increases the workload on the analytical engines. The analytical engines scan the entire data set across multiple files, which often results in query overheads.
  • 6. MEASURES TO HELP IN DESIGNING THE RIGHT DATA TRANSFORMATION STRATEGY: ● Store the data in a columnar format such as Apache Parquet or ORC, these formats offer optimized reads and are open-source, which increases the availability of data for various analytical services. ● Partitioning the data concerning the time stamp can have a great impact on search performance. ● Small files can be chunked into bigger ones asynchronously. This helps in reducing network overheads. ● Using Z-order indexed materialized views would help to serve queries including data stored in multiple columns. ● Collect data set statistics like file size, rows, histogram of values to
  • 7. ● Collect column and table statistics to estimate predicate selectivity and cost of plans. It also helps to perform certain advanced rewrites in the Data Lake. 3. PRIORITISING SECURITY IN A DATA LAKE The RSA Data Privacy and Security survey conducted in 2019 revealed that 64% of its US respondents and 72% of its UK respondents blamed the company and not the hacker for the loss of personal data. This is due to the lack of fine-grained access control mechanisms in the data lake. Along with the increase of data, tools, and users, there is a dynamic increase in the risks of security breaches. Hence curating a security strategy even before building a data lake is important. This would grab the attention of the increased agility that comes with the use of a data lake.
  • 8. The data lake security protocols must account for compliance with major security policies. POINTS TO REMEMBER WHILE CURATING AN EFFICIENT SECURITY STRATEGY: ● Authentication and authorization of the users who access the data lake must be enforced. For instance, person A might have access to edit the data lake whereas person B might have permission only to view it. They must be authenticated using passwords, usernames, multiple device authentication, etc. Integrating a strong ID management tool in the underlying Cloud Solutions provider would help in achieving this. ● The data should be encrypted at all levels i.e., when in transit and also at rest so that only the intended users can understand and use it.
  • 9. ● Access should be granted only to skilled and well-experienced administrators, thus minimizing the risk of breaches. ● The data lake platform must be hardened so that its functions are isolated from the other existing cloud services. ● Host security methods such as host intrusion detection, file integrity monitoring, and log management should be enhanced. ● Redundant copies of critical data must be stored as a backup option in another data lake so that it comes in hand in cases of data corruption or accidental deletion. 4. IMPLEMENTING WELL-FORMULATED DATA GOVERNANCE STRATEGIES A good data governance strategy ensures data quality and consistency.
  • 10. It prevents the data lake from becoming an unmanageable data swamp. KEY POINTS TO REMEMBER WHILE CRAFTING A GOVERNANCE STRATEGY FOR A DATA LAKE: ● Data should be identified and cataloged. The sensitive data must be clearly labeled. This would help the users achieve better search results. ● Creating metadata acts as a tagging system to organize data and assist people during their search for different types of data without confusion. ● No data should be stored beyond the time specified in the compliance protocols. This would result in cost issues along with compliance protocol violations. So, defining proper retention policies for the data is necessary.