Testing Strategies for Data Lake Hosted on Hadoop

This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
Testing Strategies for Data Lake
Hosted on Hadoop
August 2019 | Authors: Vaibhav Shahane and Vaibhavi Indap
CitiusTech Thought
Leadership

2
Agenda
▪ Overview
▪ Data Lake v/s Data Warehouse
▪ Structured Data
▪ Semi-structured Data
▪ References

3
Overview
▪ Data lake is a repository to store massive amount of data in its native
form
▪ Businesses have disparate sources of data which are difficult to
analyze unless brought together on a single platform (common pool
of data)
▪ Data lake allows business decision makers/data analysts/data
scientists to get holistic view of data coming in from heterogeneous
sources

4
Data Lake v/s Data Warehouse
Similarities Differences
▪ Data Lakes maintain heterogeneous
sources in a single pool
▪ It provides better access to enterprise
wide data to analysts and scientists
▪ Data is highly organized and structured in data
warehouses
▪ Data lake uses flat structure and original
schema
▪ Data present in data warehouses is
transformed, aggregated and may lose its
original schema
▪ Data warehouses provide transactional
solutions by enabling analysts to drill
down/up/through specific areas of business
▪ Data lakes answer questions that aren’t
structured but need discovery using iterative
algorithms and/or complex mathematical
functions

5
Structured Data (1/9)
Testing Gates
▪ Data Lakes creation and testing can take place around the following areas:
• Schema validations
• Data Masking validations
• Data Reconciliation at each load frequency
• ELT Framework (Extract, Load and Transform)
• On premise Vs. On cloud validations (in case data lake is being hosted on cloud)
• Data Quality and Standardization validations
• Data partitioning and compaction

6
Schema Validation
▪ Data from heterogeneous sources is present on data lakes and table schema of all the tables is
preserved in the source
▪ As a part of schema validation, QA team can cover the following pointers:
• Data type
• Data length
• Null/Not Null constraints
• Delimiters (pay attention to delimiters coming as part of data)
• Special characters (visible and invisible) For e.g., Hex code ‘c2 ad’ is a soft hyphen but
appears as a white space which doesn’t go away when TRIM function is applied during data
comparison
▪ Source metadata and metadata from data lake can be extracted from respective metastores and
compared
• If source is on SSMS, metadata can be retrieved using sp_help stored procedure.
• If data lake is on HDFS, Hive has its own metadata tables such as DBS, TBLS, COLUMNS_V2,
SDS, etc.
▪ Visit https://utf8-chartable.de/unicode-utf8-table.pl for more characters

7
Data Masking Validations
▪ Source systems might have PHI/PPI data in unmasked form unless client has anonymized it
beforehand
▪ Data masking logic implementation can be tested based on pre-agreed masking logic
▪ Masking logic can be written in SQL query / Excel formula and output compared with the data
masked by ETL code which has flowed into the data lake
▪ E.g. Unmasked SSN 123-456-7891 needs to be masked as XXX-XX-7891
▪ E.g. Unmasked email abc.def@xyz.com needs to be masked as axx.xxx@xyz.com
▪ Pay attention to unmasked data that is not coming in the expected format which causes masking
logic to fail

8
Data Reconciliation at Each Load Frequency
▪ Data Reconciliation is a testing gate wherein the data which is loaded in target is compared
against data in the source to ensure that no data is dropped/corrupted in the migration process
▪ Record count for each table between source and data lake:
• Initial load
• Incremental load
▪ Truncate load v/s. Append load:
• Truncate load: Data in target table is truncated every time a new data feed is about to be
loaded
• Append load: A new data feed is appended to already existing data in target table
▪ Data validation for each table between source and data lake: Data in each row and column in
source table to be compared with data lake
• To use MS Excel, batching requires to be done to get handful of data
• To compare entire dataset, custom built automation tool can be used
▪ Duplicates in data lake: SQL queries were used to identify if any duplicates were introduced
during data lake load

9
ELT Framework Validation
▪ Logging of type of source informs if the source data is from a table or flat file
▪ Logging of source connection string (DB link, file path, etc.):
• Indicates the database connection string if source data is coming from database table
• If source is a flat file, this informs the landing location of the file and where to read the data
from
▪ Generation of batch IDs on a fresh run and on rerun upon failure helps identify the data loaded
in a batch every day
▪ Flags such as primary key flag, truncate load flag, critical table flag, upload to cloud flag, etc. help
define the behavior of ELT jobs
▪ Logging of records processed, loaded and rejected for each table show number of records that
are extracted from source, rejected/loaded into target data lake
▪ Polling frequency, trigger check, email notification etc. indicate the frequency to poll for
incoming file/data, trigger next batch, send notifications of batch status, etc.

10
On-Premise Vs. On-Cloud Validation
▪ Additional testing is required when sources are hosted on-premise and data lake is being created
on cloud
▪ Validate data types supported by on cloud application (through which data analysts/scientists will
be querying data)
• For e.g., Azure SQL warehouse doesn’t support timestamp/text data types. One needs to
cast source data to datetime2/ varchar respectively
• For e.g., Impala does not support DOUBLE data type which has to be converted to NUMERIC
▪ Validate user group access (who can see what type of data)
▪ Validate masked/unmasked views based on type of users
▪ Validate attribute names with spaces/special characters/reserved keywords between on-premise
and on-cloud
• For e.g., Source attribute named LOCATION and DIV are reserved keyword in Impala hence, it
must be changed to LOC and DIVISION to preserve the meaning
▪ Validate external tables created on HDFS files published through ELT jobs
• For e.g., Validate whether the external table is pointing to correct location on HDFS where
the files are being published by ETL jobs
▪ Validate the data consistency between on-premise and on-cloud
• For e.g., Use custom built validation tools to compare each attribute

11
Data Quality and Standardization Validation
▪ Data in data lake needs to be cleansed and standardized for better analysis
▪ Actual data isn’t removed or updated, but flags (Warning / Invalid) can highlight the data quality
▪ Validate various DQ/DS rules with SQL queries on source data and compare output with DQ/DS
tools
▪ For e.g., Null Check DQ rule on MRN (Medical Record Number): When data is processed through
a tool for DQ, all the records with NULL MRN are flagged as WARNING/INVALID along with a
remark column that states MRN is NULL
▪ For e.g., Missing Parent DQ rule on Encounter ID with respect to Patient: When encounter
doesn’t have associated patient in patient table, Encounter ID is flagged as WARNING/INVALID
along with a remark column that states PATIENT MISSING
▪ For e.g., Race Data Standardization:
• Race data in source with codes such as 1001, 1002 need to be standardized with
corresponding description as Hispanic, Asian, etc.
• Based on requirements, standardization can be achieved on reference table or transaction
(data) tables as well

12
Data Partitioning and Compaction
▪ When data lake is being created on cloud using Hadoop file system (HDFS), it is preferred to
store data in partitions (based on user-provided partition criteria) and use compaction in order
to minimize multiple read operations from underlying HDFS
▪ Validate whether the data in source is getting partitioned appropriately while being stored in a
data lake on HDFS. For e.g., Partitioning based on Encounter_Create_Date:
• This will create a folder structure in the output by year and will contain encounters in the
file partitioned by year
• Data retrieval will be faster when analysts/scientists query on specified date range since
data is already stored in such partitions
▪ Compaction includes two areas:
• Merging of multiple smaller file chunks into a predefined file size to avoid multiple read
operations
• Converting text format into Parquet/ZIP format to achieve file size reduction

13
Data Partitioning and Compaction
▪ Validate size of the files being published on HDFS by ELT jobs by logging into Impala
• For e.g., ELT jobs produced files in .txt format on cloud which had the file size of 3.19 GB,
got reduced to 516 MB after Parquet conversion
▪ Validate merging of multiple smaller files into one or more large files
• For e.g., DQ-DS tool might produce multiple small-sized files (based on storage availability
on underlying data nodes) which can be seen at the output location. Utility can be written
to merge all these files into a single file in Parquet format

14
Structured Data – Challenges
Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special
characters
▪ Delimiters as a part of incoming data
▪ PHI data in a format different than test data may cause masking logic to fail
▪ Backdated inserts not captured in incremental runs
Tools and
Technology
▪ Limitations on data types, reserved keywords, special characters handled by
cloud applications
▪ Date format conversion based on time zone selected during installation
▪ Data retrieval challenges and cost involved for downloading data for
analysis
Configuration /
Environment
▪ Late arriving flat files
▪ Any service breakdown in production cause parts of end-to-end workflow
to break
Others ▪ Project timelines need to accommodate any unknowns in the data during
production deployment and/or few weeks after deployment

15
Semi-structured Data (1/6)
Flow Diagram Of Semi-Structured (JSON) Message Ingestion into Data lake

16
Testing Gates
▪ Data Lakes creation and testing for semi-structured data can take place in the following areas:
• JSON Message Validation
• Data Reconciliation
• ELT Framework (Extract, Load and Transform)
• Data Quality and Standardization Validations

17
JSON Message Validation
▪ Data lake can be integrated with Kafka Messaging System that produces JSON messages in semi-
structured format
▪ As a part of JSON message validation, QA team can cover the following pointers:
• Compare the JSON Message with the JSON schema provided as part of requirement
• Data Type Check
• Null / Not Null constraints Check
▪ For instance:
JSON Schema JSON Message
{
"ServiceLevel":
{
"type": ["string", "null" ]
},
"ServiceType":
{
"type": ["string"]
}
}
{
"ServiceLevel": "One",
"ServiceType": "Skilled Nurse"
}

18
Data Reconciliation
▪ Data Reconciliation is a testing gate wherein the data loaded in target tables is compared against
the data in the source JSON messages to ensure no data is trimmed/corrupted/missed in the
migration process. One can use the following strategies to ensure the same:
▪ Record count for each table between source and data lake
• Simple JSON: Single JSON message ingested in lake is loaded as single row in target table
• Complex JSON: More than one row is loaded in the target table depending on the level of
hierarchy and nesting present in the JSON message
▪ Data validation for each table between source and data lake. Data in each row and column in
JSON message to be compared with data lake
• To use openjson function in SQL Server for parsing the JSON messages and converting them
into structured format
• To compare the parsed output of openjson with the data loaded in target tables using
Python

19
ETL Framework Validation
▪ Raw layer Validation on HDFS: Indicates whether the source JSON messages ingested through
Kafka are loaded in HDFS in raw form, before the processing tool selects the messages and loads
it into the target tables
▪ Logging of Kafka details: Informs about Kafka topic, partition, offset, and hostname
▪ Generation of Data lake UIDs helps in identifying JSON messages
▪ Logging of records processed, loaded and rejected as part of each JSON ingestion shows the
amount of records ingested through Kafka, processed, and failed in JSON schema validation with
error logging
▪ Email notification: This shows JSONs ingested through Kafka on hourly basis, JSON count loaded
in raw layer, and daily report which shows JSON wise count, failures and successful ingestion

20
Data Quality and Standardization Validation
▪ The data in data lake needs to be cleansed and standardized for better analysis
▪ In this case, the actual data need not be removed or updated but both valid and erroneous data
are logged into the audit tables
▪ Validate JSON messages against JSON schema
• For e.g., Null Check: If non-nullable attribute is assigned null value in the input JSON
message, then ingestion of such message fails, and the error gets logged in audit table with
non-nullable attribute details

21
Semi-structured Data – Challenges
Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special
characters
▪ Test data preparation as per test scenarios
▪ Manual validation of single JSON loaded into multiple tables
▪ Live reconciliation of messages produced through Kafka due to continuous
streaming
Tools and
Technology
▪ Limitations on reserved keywords and special character handling
Configuration /
Environment
▪ Multiple services running simultaneously on cluster results in choking of
JSON messages in QA environment
▪ Any service breakdown in production cause parts of end-to-end workflow
to break
Others ▪ Project timelines need to accommodate any unknowns in the data during
production deployment and/or few weeks after deployment

22
References
▪ https://www.cloudmoyo.com/blog/difference-between-a-data-warehouse-and-a-data-lake/
▪ https://utf8-chartable.de/unicode-utf8-table.pl
▪ https://www.cigniti.com/blog/5-big-data-testing-challenges/
▪ https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/
▪ https://kafka.apache.org/
▪ https://dzone.com/articles/json-drivers-parsing-hierarchical-data

About CitiusTech
3,500+
Healthcare IT professionals worldwide
1,500+
Healthcare software engineering
800+
HL7 certified professionals
30%+
CAGR over last 5 years
110+
Healthcare customers
▪ Healthcare technology companies
▪ Hospitals, IDNs & medical groups
▪ Payers and health plans
▪ ACO, MCO, HIE, HIX, NHIN and RHIO
▪ Pharma & Life Sciences companies
23
Thank You
Authors:
Vaibhav Shahane
Vaibhavi Indap
Technical Lead
thoughtleaders@citiustech.com

Testing Strategies for Data Lake Hosted on Hadoop

More Related Content

What's hot

Similar to Testing Strategies for Data Lake Hosted on Hadoop

More from CitiusTech

Recently uploaded

Testing Strategies for Data Lake Hosted on Hadoop