Successfully reported this slideshow.
Your SlideShare is downloading. ×

Testing Strategies for Data Lake Hosted on Hadoop

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Testing Strategies for Data Lake Hosted on Hadoop

  1. 1. This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech. Testing Strategies for Data Lake Hosted on Hadoop August 2019 | Authors: Vaibhav Shahane and Vaibhavi Indap CitiusTech Thought Leadership
  2. 2. 2 Agenda ▪ Overview ▪ Data Lake v/s Data Warehouse ▪ Structured Data ▪ Semi-structured Data ▪ References
  3. 3. 3 Overview ▪ Data lake is a repository to store massive amount of data in its native form ▪ Businesses have disparate sources of data which are difficult to analyze unless brought together on a single platform (common pool of data) ▪ Data lake allows business decision makers/data analysts/data scientists to get holistic view of data coming in from heterogeneous sources
  4. 4. 4 Data Lake v/s Data Warehouse Similarities Differences ▪ Data Lakes maintain heterogeneous sources in a single pool ▪ It provides better access to enterprise wide data to analysts and scientists ▪ Data is highly organized and structured in data warehouses ▪ Data lake uses flat structure and original schema ▪ Data present in data warehouses is transformed, aggregated and may lose its original schema ▪ Data warehouses provide transactional solutions by enabling analysts to drill down/up/through specific areas of business ▪ Data lakes answer questions that aren’t structured but need discovery using iterative algorithms and/or complex mathematical functions
  5. 5. 5 Structured Data (1/9) Testing Gates ▪ Data Lakes creation and testing can take place around the following areas: • Schema validations • Data Masking validations • Data Reconciliation at each load frequency • ELT Framework (Extract, Load and Transform) • On premise Vs. On cloud validations (in case data lake is being hosted on cloud) • Data Quality and Standardization validations • Data partitioning and compaction
  6. 6. 6 Structured Data (2/9) Schema Validation ▪ Data from heterogeneous sources is present on data lakes and table schema of all the tables is preserved in the source ▪ As a part of schema validation, QA team can cover the following pointers: • Data type • Data length • Null/Not Null constraints • Delimiters (pay attention to delimiters coming as part of data) • Special characters (visible and invisible) For e.g., Hex code ‘c2 ad’ is a soft hyphen but appears as a white space which doesn’t go away when TRIM function is applied during data comparison ▪ Source metadata and metadata from data lake can be extracted from respective metastores and compared • If source is on SSMS, metadata can be retrieved using sp_help stored procedure. • If data lake is on HDFS, Hive has its own metadata tables such as DBS, TBLS, COLUMNS_V2, SDS, etc. ▪ Visit https://utf8-chartable.de/unicode-utf8-table.pl for more characters
  7. 7. 7 Structured Data (3/9) Data Masking Validations ▪ Source systems might have PHI/PPI data in unmasked form unless client has anonymized it beforehand ▪ Data masking logic implementation can be tested based on pre-agreed masking logic ▪ Masking logic can be written in SQL query / Excel formula and output compared with the data masked by ETL code which has flowed into the data lake ▪ E.g. Unmasked SSN 123-456-7891 needs to be masked as XXX-XX-7891 ▪ E.g. Unmasked email abc.def@xyz.com needs to be masked as axx.xxx@xyz.com ▪ Pay attention to unmasked data that is not coming in the expected format which causes masking logic to fail
  8. 8. 8 Data Reconciliation at Each Load Frequency ▪ Data Reconciliation is a testing gate wherein the data which is loaded in target is compared against data in the source to ensure that no data is dropped/corrupted in the migration process ▪ Record count for each table between source and data lake: • Initial load • Incremental load ▪ Truncate load v/s. Append load: • Truncate load: Data in target table is truncated every time a new data feed is about to be loaded • Append load: A new data feed is appended to already existing data in target table ▪ Data validation for each table between source and data lake: Data in each row and column in source table to be compared with data lake • To use MS Excel, batching requires to be done to get handful of data • To compare entire dataset, custom built automation tool can be used ▪ Duplicates in data lake: SQL queries were used to identify if any duplicates were introduced during data lake load Structured Data (4/9)
  9. 9. 9 Structured Data (5/9) ELT Framework Validation ▪ Logging of type of source informs if the source data is from a table or flat file ▪ Logging of source connection string (DB link, file path, etc.): • Indicates the database connection string if source data is coming from database table • If source is a flat file, this informs the landing location of the file and where to read the data from ▪ Generation of batch IDs on a fresh run and on rerun upon failure helps identify the data loaded in a batch every day ▪ Flags such as primary key flag, truncate load flag, critical table flag, upload to cloud flag, etc. help define the behavior of ELT jobs ▪ Logging of records processed, loaded and rejected for each table show number of records that are extracted from source, rejected/loaded into target data lake ▪ Polling frequency, trigger check, email notification etc. indicate the frequency to poll for incoming file/data, trigger next batch, send notifications of batch status, etc.
  10. 10. 10 Structured Data (6/9) On-Premise Vs. On-Cloud Validation ▪ Additional testing is required when sources are hosted on-premise and data lake is being created on cloud ▪ Validate data types supported by on cloud application (through which data analysts/scientists will be querying data) • For e.g., Azure SQL warehouse doesn’t support timestamp/text data types. One needs to cast source data to datetime2/ varchar respectively • For e.g., Impala does not support DOUBLE data type which has to be converted to NUMERIC ▪ Validate user group access (who can see what type of data) ▪ Validate masked/unmasked views based on type of users ▪ Validate attribute names with spaces/special characters/reserved keywords between on-premise and on-cloud • For e.g., Source attribute named LOCATION and DIV are reserved keyword in Impala hence, it must be changed to LOC and DIVISION to preserve the meaning ▪ Validate external tables created on HDFS files published through ELT jobs • For e.g., Validate whether the external table is pointing to correct location on HDFS where the files are being published by ETL jobs ▪ Validate the data consistency between on-premise and on-cloud • For e.g., Use custom built validation tools to compare each attribute
  11. 11. 11 Structured Data (7/9) Data Quality and Standardization Validation ▪ Data in data lake needs to be cleansed and standardized for better analysis ▪ Actual data isn’t removed or updated, but flags (Warning / Invalid) can highlight the data quality ▪ Validate various DQ/DS rules with SQL queries on source data and compare output with DQ/DS tools ▪ For e.g., Null Check DQ rule on MRN (Medical Record Number): When data is processed through a tool for DQ, all the records with NULL MRN are flagged as WARNING/INVALID along with a remark column that states MRN is NULL ▪ For e.g., Missing Parent DQ rule on Encounter ID with respect to Patient: When encounter doesn’t have associated patient in patient table, Encounter ID is flagged as WARNING/INVALID along with a remark column that states PATIENT MISSING ▪ For e.g., Race Data Standardization: • Race data in source with codes such as 1001, 1002 need to be standardized with corresponding description as Hispanic, Asian, etc. • Based on requirements, standardization can be achieved on reference table or transaction (data) tables as well
  12. 12. 12 Structured Data (8/9) Data Partitioning and Compaction ▪ When data lake is being created on cloud using Hadoop file system (HDFS), it is preferred to store data in partitions (based on user-provided partition criteria) and use compaction in order to minimize multiple read operations from underlying HDFS ▪ Validate whether the data in source is getting partitioned appropriately while being stored in a data lake on HDFS. For e.g., Partitioning based on Encounter_Create_Date: • This will create a folder structure in the output by year and will contain encounters in the file partitioned by year • Data retrieval will be faster when analysts/scientists query on specified date range since data is already stored in such partitions ▪ Compaction includes two areas: • Merging of multiple smaller file chunks into a predefined file size to avoid multiple read operations • Converting text format into Parquet/ZIP format to achieve file size reduction
  13. 13. 13 Structured Data (9/9) Data Partitioning and Compaction ▪ Validate size of the files being published on HDFS by ELT jobs by logging into Impala • For e.g., ELT jobs produced files in .txt format on cloud which had the file size of 3.19 GB, got reduced to 516 MB after Parquet conversion ▪ Validate merging of multiple smaller files into one or more large files • For e.g., DQ-DS tool might produce multiple small-sized files (based on storage availability on underlying data nodes) which can be seen at the output location. Utility can be written to merge all these files into a single file in Parquet format
  14. 14. 14 Structured Data – Challenges Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special characters ▪ Delimiters as a part of incoming data ▪ PHI data in a format different than test data may cause masking logic to fail ▪ Backdated inserts not captured in incremental runs Tools and Technology ▪ Limitations on data types, reserved keywords, special characters handled by cloud applications ▪ Date format conversion based on time zone selected during installation ▪ Data retrieval challenges and cost involved for downloading data for analysis Configuration / Environment ▪ Late arriving flat files ▪ Any service breakdown in production cause parts of end-to-end workflow to break Others ▪ Project timelines need to accommodate any unknowns in the data during production deployment and/or few weeks after deployment
  15. 15. 15 Semi-structured Data (1/6) Flow Diagram Of Semi-Structured (JSON) Message Ingestion into Data lake
  16. 16. 16 Semi-structured Data (2/6) Testing Gates ▪ Data Lakes creation and testing for semi-structured data can take place in the following areas: • JSON Message Validation • Data Reconciliation • ELT Framework (Extract, Load and Transform) • Data Quality and Standardization Validations
  17. 17. 17 Semi-structured Data (3/6) JSON Message Validation ▪ Data lake can be integrated with Kafka Messaging System that produces JSON messages in semi- structured format ▪ As a part of JSON message validation, QA team can cover the following pointers: • Compare the JSON Message with the JSON schema provided as part of requirement • Data Type Check • Null / Not Null constraints Check ▪ For instance: JSON Schema JSON Message { "ServiceLevel": { "type": ["string", "null" ] }, "ServiceType": { "type": ["string"] } } { "ServiceLevel": "One", "ServiceType": "Skilled Nurse" }
  18. 18. 18 Semi-structured Data (4/6) Data Reconciliation ▪ Data Reconciliation is a testing gate wherein the data loaded in target tables is compared against the data in the source JSON messages to ensure no data is trimmed/corrupted/missed in the migration process. One can use the following strategies to ensure the same: ▪ Record count for each table between source and data lake • Simple JSON: Single JSON message ingested in lake is loaded as single row in target table • Complex JSON: More than one row is loaded in the target table depending on the level of hierarchy and nesting present in the JSON message ▪ Data validation for each table between source and data lake. Data in each row and column in JSON message to be compared with data lake • To use openjson function in SQL Server for parsing the JSON messages and converting them into structured format • To compare the parsed output of openjson with the data loaded in target tables using Python
  19. 19. 19 Semi-structured Data (5/6) ETL Framework Validation ▪ Raw layer Validation on HDFS: Indicates whether the source JSON messages ingested through Kafka are loaded in HDFS in raw form, before the processing tool selects the messages and loads it into the target tables ▪ Logging of Kafka details: Informs about Kafka topic, partition, offset, and hostname ▪ Generation of Data lake UIDs helps in identifying JSON messages ▪ Logging of records processed, loaded and rejected as part of each JSON ingestion shows the amount of records ingested through Kafka, processed, and failed in JSON schema validation with error logging ▪ Email notification: This shows JSONs ingested through Kafka on hourly basis, JSON count loaded in raw layer, and daily report which shows JSON wise count, failures and successful ingestion
  20. 20. 20 Semi-structured Data (6/6) Data Quality and Standardization Validation ▪ The data in data lake needs to be cleansed and standardized for better analysis ▪ In this case, the actual data need not be removed or updated but both valid and erroneous data are logged into the audit tables ▪ Validate JSON messages against JSON schema • For e.g., Null Check: If non-nullable attribute is assigned null value in the input JSON message, then ingestion of such message fails, and the error gets logged in audit table with non-nullable attribute details
  21. 21. 21 Semi-structured Data – Challenges Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special characters ▪ Test data preparation as per test scenarios ▪ Manual validation of single JSON loaded into multiple tables ▪ Live reconciliation of messages produced through Kafka due to continuous streaming Tools and Technology ▪ Limitations on reserved keywords and special character handling Configuration / Environment ▪ Multiple services running simultaneously on cluster results in choking of JSON messages in QA environment ▪ Any service breakdown in production cause parts of end-to-end workflow to break Others ▪ Project timelines need to accommodate any unknowns in the data during production deployment and/or few weeks after deployment
  22. 22. 22 References ▪ https://www.cloudmoyo.com/blog/difference-between-a-data-warehouse-and-a-data-lake/ ▪ https://utf8-chartable.de/unicode-utf8-table.pl ▪ https://www.cigniti.com/blog/5-big-data-testing-challenges/ ▪ https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/ ▪ https://kafka.apache.org/ ▪ https://dzone.com/articles/json-drivers-parsing-hierarchical-data
  23. 23. About CitiusTech 3,500+ Healthcare IT professionals worldwide 1,500+ Healthcare software engineering 800+ HL7 certified professionals 30%+ CAGR over last 5 years 110+ Healthcare customers ▪ Healthcare technology companies ▪ Hospitals, IDNs & medical groups ▪ Payers and health plans ▪ ACO, MCO, HIE, HIX, NHIN and RHIO ▪ Pharma & Life Sciences companies 23 Thank You Authors: Vaibhav Shahane Vaibhavi Indap Technical Lead thoughtleaders@citiustech.com

×