SlideShare a Scribd company logo
1 of 4
Contents
WHYTEST DATA LAKES .........................................................................................................................................................2
Testing Approach & Tools.............................................................................................................................................................2
ELT Testing.................................................................................................................................................................................3
Data Staging Validation........................................................................................................................................................3
Map reduce Validation..........................................................................................................................................................3
Output validation....................................................................................................................................................................3
Architectural Testing..................................................................................................................................................................3
Security Testing ..........................................................................................................................................................................4
Visualization Testing..................................................................................................................................................................4
References.........................................................................................................................................................................................4
WHY TEST DATA LAKES
With data scientists requiring the access to raw data from multiple sources for effective analytical discovery and
ideation, data lakes (a repository with several forms of data), are providing a platform for preserving the original
data fidelity and the lineage of data transformations.
Developed on the principles of ELT wherein the data is first loaded, extracted, and transformations are performed,
testing a Data Lake involves a complex task, requiring integration of numerous technologies for data storage,
ingestion, processing etc.With no existing standard rules for security, governance & collaboration, things get even
more complicated. Testing of data sets involves more of a verification of its data processing checks and various
characteristics check, like conformity, accuracy, duplication, consistency,validity and data completeness using
different tools, approaches and frameworks.
In comparison to the traditional data warehouse, the scope of data lake testing varies in multiple prospects ofdata,
infrastructure & Validation strategy
 Heterogeneous and unstructured data spread across different layers
 The continuous explosion of data and information resulting in bad data
 Difficult business processesdue to complicated business logic
 Ineffective decision making due to bad or poordata
 The increased cost of handling variety, volume, and velocity of large data sets
 The wider scope of dataset and source needs larger data governance & support
 Performance issues due to heightened data volumes.
Verification required at different stages for data lake testing can be depicted as:
Testing Approach & Tools
The various approaches that can be followed for performing the testing for data lakes will include:
1) ELT Testing
Transferring the raw data into the HDFS systemwill require validation across the job executions in different
environments, row counts & duplicate checks, data type and value checks, Key file set up a check, partitioning, delta
& full load check. The different stages to be considered as part of this migration testing include:
a) Data Staging Validation
When data is extracted from various sources such as social media, weblogs, RDBMS, and uploaded to HDFS, an
initial stage of testing is carried out.
Activities in this stage include:
 Data from various source like Databases, Web servers, Emails, IoT, and FTP, etc. should be validated to
make sure that correct data is pulled into the system
 Comparison of the source data with the data loaded into the Hadoop systemshould be validated for data
correctness assurance
 Extracted data should be verified across different ingestion methods (one-time, batch and real-time load) in
scope.
b) Map reduce Validation
The second step is a validation of "MapReduce".In this stage,the testerverifies the business logic validation on
every node and then validating them after running against multiple nodes,ensuring that:
 Map Reduce process works correctly.
 Data aggregation or segregation rules are implemented on the data
 Key value pairs are generated
 Validating the data after the Map-Reduce process
c) Output validation
The final or third stage of testing is the output validation process.The output data files are generated and ready to be
moved to an EDW (Enterprise Data Warehouse)or any other systembased on analysis or analytics.
Activities in the third stage include
 To check the transformation rules are correctly applied
 To check the data integrity and successfuldata load into the target system
 To check that there is no data corruption by comparing the target data with the HDFS file systemdata
2) Architectural Testing
Architecture Testing forms a crucial part of data lake Testing as a poor architecture will lead to poor performance.
Also, since the data lake technologies are extremely resource intensive and process large volumes of
data, architectural testing becomes essential. Along with this, since a lot of shifting of data is involved in the
process,Performance Testing assumes an even more important role in identifying:
 Memory utilization
 Job completion time
 Data Throughput
 Data Storage: How data is stored in different nodes
 Commit logs: How large the commit log is allowed to grow
 Concurrency: How many threads can perform write and read operation
 Caching: Tune the cache setting "row cache" and "key cache."
 Timeouts: Values for connection timeout, query timeout, etc.
 JVM Parameters: Heap size, GC collection algorithms, etc.
 Map reduce performance: Sorts, merge, etc.
 Message queue:Message rate, size, etc.
To conduct Performance Testing, a structured approach needs to be strategized since it involves huge volumes of
structured and unstructured data both.The teams involved need to have proficiency in order to apply the defined
approach as follows:
1. Setting up of the application cluster that needs to be tested.
2. Identifying the designing the corresponding workloads.
3. Preparing individual customscripts to check
 sub-component performance
 how each individual component performs in isolation.
4. Executing the test and analyzing the results.
 The rate at which the systemconsumes data from different data sources
 The speed at which the Map-Reduce jobs or queries are executed.
5. Re-configuring and re-testing components that did not perform optimally.
3) Security Testing
Since data lakes are holding entire enterprise data,it is required that the security testing is performed to verify the
authentication and authorization for different roles as well as encryption of data at rest and in motion
4) Visualization Testing
When a new report or dashboard is developed for consumption by other users,it is important to perform a few
checks to validate the data and design of the included reports. Key aspects of validation will include
 Design Check
 Prompt Check
 Data Accuracy Check
 Drill Down Report Check
 Browser Checks
References
https://qaconsultants.com/wp-content/uploads/2015/10/Primer-on-Big-Data-Testing.pdf

More Related Content

What's hot

Data mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationData mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationijcsit
 
Panda public auditing for shared data with efficient user revocation in the c...
Panda public auditing for shared data with efficient user revocation in the c...Panda public auditing for shared data with efficient user revocation in the c...
Panda public auditing for shared data with efficient user revocation in the c...IGEEKS TECHNOLOGIES
 
Final report group2
Final report group2Final report group2
Final report group2George Sam
 
Towards Secure and Dependable Storage Services in Cloud Computing
Towards Secure and Dependable Storage Services in Cloud  Computing Towards Secure and Dependable Storage Services in Cloud  Computing
Towards Secure and Dependable Storage Services in Cloud Computing IJMER
 
Towards secure and dependable storage
Towards secure and dependable storageTowards secure and dependable storage
Towards secure and dependable storageKhaja Moiz Uddin
 
Authorized Duplicate Check Scheme
Authorized Duplicate Check SchemeAuthorized Duplicate Check Scheme
Authorized Duplicate Check SchemeIRJET Journal
 

What's hot (6)

Data mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationData mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configuration
 
Panda public auditing for shared data with efficient user revocation in the c...
Panda public auditing for shared data with efficient user revocation in the c...Panda public auditing for shared data with efficient user revocation in the c...
Panda public auditing for shared data with efficient user revocation in the c...
 
Final report group2
Final report group2Final report group2
Final report group2
 
Towards Secure and Dependable Storage Services in Cloud Computing
Towards Secure and Dependable Storage Services in Cloud  Computing Towards Secure and Dependable Storage Services in Cloud  Computing
Towards Secure and Dependable Storage Services in Cloud Computing
 
Towards secure and dependable storage
Towards secure and dependable storageTowards secure and dependable storage
Towards secure and dependable storage
 
Authorized Duplicate Check Scheme
Authorized Duplicate Check SchemeAuthorized Duplicate Check Scheme
Authorized Duplicate Check Scheme
 

Similar to Testing insights from data lakes

Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testingNarola Infotech
 
Data Warehouse (ETL) testing process
Data Warehouse (ETL) testing processData Warehouse (ETL) testing process
Data Warehouse (ETL) testing processRakesh Hansalia
 
Software Project Management: Testing Document
Software Project Management: Testing DocumentSoftware Project Management: Testing Document
Software Project Management: Testing DocumentMinhas Kamal
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupQualitest
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overviewdublinx
 
KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyDataStax Academy
 
Warehouse Planning and Implementation
Warehouse Planning and ImplementationWarehouse Planning and Implementation
Warehouse Planning and ImplementationSHIKHA GAUTAM
 
20171019 data migration (rk)
20171019 data migration (rk)20171019 data migration (rk)
20171019 data migration (rk)Ruud Kapteijn
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...Cognizant
 
Computer aided audit techniques (CAAT) sourav mathur
Computer aided audit techniques (CAAT)  sourav mathurComputer aided audit techniques (CAAT)  sourav mathur
Computer aided audit techniques (CAAT) sourav mathursourav mathur
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And IntegrityGerrit Klaschke, CSM
 
Extensive Security and Performance Analysis Shows the Proposed Schemes Are Pr...
Extensive Security and Performance Analysis Shows the Proposed Schemes Are Pr...Extensive Security and Performance Analysis Shows the Proposed Schemes Are Pr...
Extensive Security and Performance Analysis Shows the Proposed Schemes Are Pr...IJERA Editor
 
Etl testing strategies
Etl testing strategiesEtl testing strategies
Etl testing strategiessivam_1
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesFinalyear Projects
 
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...Finalyear Projects
 

Similar to Testing insights from data lakes (20)

Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testing
 
Data Warehouse (ETL) testing process
Data Warehouse (ETL) testing processData Warehouse (ETL) testing process
Data Warehouse (ETL) testing process
 
Software Project Management: Testing Document
Software Project Management: Testing DocumentSoftware Project Management: Testing Document
Software Project Management: Testing Document
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overview
 
KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation Methodology
 
Warehouse Planning and Implementation
Warehouse Planning and ImplementationWarehouse Planning and Implementation
Warehouse Planning and Implementation
 
20171019 data migration (rk)
20171019 data migration (rk)20171019 data migration (rk)
20171019 data migration (rk)
 
Data warehouse testing
Data warehouse testingData warehouse testing
Data warehouse testing
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
Etl testing
Etl testingEtl testing
Etl testing
 
Computer aided audit techniques (CAAT) sourav mathur
Computer aided audit techniques (CAAT)  sourav mathurComputer aided audit techniques (CAAT)  sourav mathur
Computer aided audit techniques (CAAT) sourav mathur
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Extensive Security and Performance Analysis Shows the Proposed Schemes Are Pr...
Extensive Security and Performance Analysis Shows the Proposed Schemes Are Pr...Extensive Security and Performance Analysis Shows the Proposed Schemes Are Pr...
Extensive Security and Performance Analysis Shows the Proposed Schemes Are Pr...
 
Etl testing strategies
Etl testing strategiesEtl testing strategies
Etl testing strategies
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Cloud Storage and Security
Cloud Storage and SecurityCloud Storage and Security
Cloud Storage and Security
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehouses
 
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
 

Recently uploaded

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Testing insights from data lakes

  • 1. Contents WHYTEST DATA LAKES .........................................................................................................................................................2 Testing Approach & Tools.............................................................................................................................................................2 ELT Testing.................................................................................................................................................................................3 Data Staging Validation........................................................................................................................................................3 Map reduce Validation..........................................................................................................................................................3 Output validation....................................................................................................................................................................3 Architectural Testing..................................................................................................................................................................3 Security Testing ..........................................................................................................................................................................4 Visualization Testing..................................................................................................................................................................4 References.........................................................................................................................................................................................4
  • 2. WHY TEST DATA LAKES With data scientists requiring the access to raw data from multiple sources for effective analytical discovery and ideation, data lakes (a repository with several forms of data), are providing a platform for preserving the original data fidelity and the lineage of data transformations. Developed on the principles of ELT wherein the data is first loaded, extracted, and transformations are performed, testing a Data Lake involves a complex task, requiring integration of numerous technologies for data storage, ingestion, processing etc.With no existing standard rules for security, governance & collaboration, things get even more complicated. Testing of data sets involves more of a verification of its data processing checks and various characteristics check, like conformity, accuracy, duplication, consistency,validity and data completeness using different tools, approaches and frameworks. In comparison to the traditional data warehouse, the scope of data lake testing varies in multiple prospects ofdata, infrastructure & Validation strategy  Heterogeneous and unstructured data spread across different layers  The continuous explosion of data and information resulting in bad data  Difficult business processesdue to complicated business logic  Ineffective decision making due to bad or poordata  The increased cost of handling variety, volume, and velocity of large data sets  The wider scope of dataset and source needs larger data governance & support  Performance issues due to heightened data volumes. Verification required at different stages for data lake testing can be depicted as: Testing Approach & Tools The various approaches that can be followed for performing the testing for data lakes will include:
  • 3. 1) ELT Testing Transferring the raw data into the HDFS systemwill require validation across the job executions in different environments, row counts & duplicate checks, data type and value checks, Key file set up a check, partitioning, delta & full load check. The different stages to be considered as part of this migration testing include: a) Data Staging Validation When data is extracted from various sources such as social media, weblogs, RDBMS, and uploaded to HDFS, an initial stage of testing is carried out. Activities in this stage include:  Data from various source like Databases, Web servers, Emails, IoT, and FTP, etc. should be validated to make sure that correct data is pulled into the system  Comparison of the source data with the data loaded into the Hadoop systemshould be validated for data correctness assurance  Extracted data should be verified across different ingestion methods (one-time, batch and real-time load) in scope. b) Map reduce Validation The second step is a validation of "MapReduce".In this stage,the testerverifies the business logic validation on every node and then validating them after running against multiple nodes,ensuring that:  Map Reduce process works correctly.  Data aggregation or segregation rules are implemented on the data  Key value pairs are generated  Validating the data after the Map-Reduce process c) Output validation The final or third stage of testing is the output validation process.The output data files are generated and ready to be moved to an EDW (Enterprise Data Warehouse)or any other systembased on analysis or analytics. Activities in the third stage include  To check the transformation rules are correctly applied  To check the data integrity and successfuldata load into the target system  To check that there is no data corruption by comparing the target data with the HDFS file systemdata 2) Architectural Testing Architecture Testing forms a crucial part of data lake Testing as a poor architecture will lead to poor performance. Also, since the data lake technologies are extremely resource intensive and process large volumes of data, architectural testing becomes essential. Along with this, since a lot of shifting of data is involved in the process,Performance Testing assumes an even more important role in identifying:  Memory utilization  Job completion time  Data Throughput  Data Storage: How data is stored in different nodes  Commit logs: How large the commit log is allowed to grow
  • 4.  Concurrency: How many threads can perform write and read operation  Caching: Tune the cache setting "row cache" and "key cache."  Timeouts: Values for connection timeout, query timeout, etc.  JVM Parameters: Heap size, GC collection algorithms, etc.  Map reduce performance: Sorts, merge, etc.  Message queue:Message rate, size, etc. To conduct Performance Testing, a structured approach needs to be strategized since it involves huge volumes of structured and unstructured data both.The teams involved need to have proficiency in order to apply the defined approach as follows: 1. Setting up of the application cluster that needs to be tested. 2. Identifying the designing the corresponding workloads. 3. Preparing individual customscripts to check  sub-component performance  how each individual component performs in isolation. 4. Executing the test and analyzing the results.  The rate at which the systemconsumes data from different data sources  The speed at which the Map-Reduce jobs or queries are executed. 5. Re-configuring and re-testing components that did not perform optimally. 3) Security Testing Since data lakes are holding entire enterprise data,it is required that the security testing is performed to verify the authentication and authorization for different roles as well as encryption of data at rest and in motion 4) Visualization Testing When a new report or dashboard is developed for consumption by other users,it is important to perform a few checks to validate the data and design of the included reports. Key aspects of validation will include  Design Check  Prompt Check  Data Accuracy Check  Drill Down Report Check  Browser Checks References https://qaconsultants.com/wp-content/uploads/2015/10/Primer-on-Big-Data-Testing.pdf