With data flowing from different mediums (RDBMS, SocialMedium, Legacy files),one of the efficacious mean for processing huge data is through Big Data, where Data Lake plays a critical role for storing Structured/Semi Structured and Unstructured Data. I have tried to give you a glimpse of , how testing of Data Lake is generally performed, what are the different types and approaches we follow.
2. WHY TEST DATA LAKES
With data scientists requiring the access to raw data from multiple sources for effective analytical discovery and
ideation, data lakes (a repository with several forms of data), are providing a platform for preserving the original
data fidelity and the lineage of data transformations.
Developed on the principles of ELT wherein the data is first loaded, extracted, and transformations are performed,
testing a Data Lake involves a complex task, requiring integration of numerous technologies for data storage,
ingestion, processing etc.With no existing standard rules for security, governance & collaboration, things get even
more complicated. Testing of data sets involves more of a verification of its data processing checks and various
characteristics check, like conformity, accuracy, duplication, consistency,validity and data completeness using
different tools, approaches and frameworks.
In comparison to the traditional data warehouse, the scope of data lake testing varies in multiple prospects ofdata,
infrastructure & Validation strategy
Heterogeneous and unstructured data spread across different layers
The continuous explosion of data and information resulting in bad data
Difficult business processesdue to complicated business logic
Ineffective decision making due to bad or poordata
The increased cost of handling variety, volume, and velocity of large data sets
The wider scope of dataset and source needs larger data governance & support
Performance issues due to heightened data volumes.
Verification required at different stages for data lake testing can be depicted as:
Testing Approach & Tools
The various approaches that can be followed for performing the testing for data lakes will include:
3. 1) ELT Testing
Transferring the raw data into the HDFS systemwill require validation across the job executions in different
environments, row counts & duplicate checks, data type and value checks, Key file set up a check, partitioning, delta
& full load check. The different stages to be considered as part of this migration testing include:
a) Data Staging Validation
When data is extracted from various sources such as social media, weblogs, RDBMS, and uploaded to HDFS, an
initial stage of testing is carried out.
Activities in this stage include:
Data from various source like Databases, Web servers, Emails, IoT, and FTP, etc. should be validated to
make sure that correct data is pulled into the system
Comparison of the source data with the data loaded into the Hadoop systemshould be validated for data
correctness assurance
Extracted data should be verified across different ingestion methods (one-time, batch and real-time load) in
scope.
b) Map reduce Validation
The second step is a validation of "MapReduce".In this stage,the testerverifies the business logic validation on
every node and then validating them after running against multiple nodes,ensuring that:
Map Reduce process works correctly.
Data aggregation or segregation rules are implemented on the data
Key value pairs are generated
Validating the data after the Map-Reduce process
c) Output validation
The final or third stage of testing is the output validation process.The output data files are generated and ready to be
moved to an EDW (Enterprise Data Warehouse)or any other systembased on analysis or analytics.
Activities in the third stage include
To check the transformation rules are correctly applied
To check the data integrity and successfuldata load into the target system
To check that there is no data corruption by comparing the target data with the HDFS file systemdata
2) Architectural Testing
Architecture Testing forms a crucial part of data lake Testing as a poor architecture will lead to poor performance.
Also, since the data lake technologies are extremely resource intensive and process large volumes of
data, architectural testing becomes essential. Along with this, since a lot of shifting of data is involved in the
process,Performance Testing assumes an even more important role in identifying:
Memory utilization
Job completion time
Data Throughput
Data Storage: How data is stored in different nodes
Commit logs: How large the commit log is allowed to grow
4. Concurrency: How many threads can perform write and read operation
Caching: Tune the cache setting "row cache" and "key cache."
Timeouts: Values for connection timeout, query timeout, etc.
JVM Parameters: Heap size, GC collection algorithms, etc.
Map reduce performance: Sorts, merge, etc.
Message queue:Message rate, size, etc.
To conduct Performance Testing, a structured approach needs to be strategized since it involves huge volumes of
structured and unstructured data both.The teams involved need to have proficiency in order to apply the defined
approach as follows:
1. Setting up of the application cluster that needs to be tested.
2. Identifying the designing the corresponding workloads.
3. Preparing individual customscripts to check
sub-component performance
how each individual component performs in isolation.
4. Executing the test and analyzing the results.
The rate at which the systemconsumes data from different data sources
The speed at which the Map-Reduce jobs or queries are executed.
5. Re-configuring and re-testing components that did not perform optimally.
3) Security Testing
Since data lakes are holding entire enterprise data,it is required that the security testing is performed to verify the
authentication and authorization for different roles as well as encryption of data at rest and in motion
4) Visualization Testing
When a new report or dashboard is developed for consumption by other users,it is important to perform a few
checks to validate the data and design of the included reports. Key aspects of validation will include
Design Check
Prompt Check
Data Accuracy Check
Drill Down Report Check
Browser Checks
References
https://qaconsultants.com/wp-content/uploads/2015/10/Primer-on-Big-Data-Testing.pdf