SlideShare a Scribd company logo
1 of 23
Download to read offline
This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
Testing Strategies for Data Lake
Hosted on Hadoop
August 2019 | Authors: Vaibhav Shahane and Vaibhavi Indap
CitiusTech Thought
Leadership
2
Agenda
▪ Overview
▪ Data Lake v/s Data Warehouse
▪ Structured Data
▪ Semi-structured Data
▪ References
3
Overview
▪ Data lake is a repository to store massive amount of data in its native
form
▪ Businesses have disparate sources of data which are difficult to
analyze unless brought together on a single platform (common pool
of data)
▪ Data lake allows business decision makers/data analysts/data
scientists to get holistic view of data coming in from heterogeneous
sources
4
Data Lake v/s Data Warehouse
Similarities Differences
▪ Data Lakes maintain heterogeneous
sources in a single pool
▪ It provides better access to enterprise
wide data to analysts and scientists
▪ Data is highly organized and structured in data
warehouses
▪ Data lake uses flat structure and original
schema
▪ Data present in data warehouses is
transformed, aggregated and may lose its
original schema
▪ Data warehouses provide transactional
solutions by enabling analysts to drill
down/up/through specific areas of business
▪ Data lakes answer questions that aren’t
structured but need discovery using iterative
algorithms and/or complex mathematical
functions
5
Structured Data (1/9)
Testing Gates
▪ Data Lakes creation and testing can take place around the following areas:
• Schema validations
• Data Masking validations
• Data Reconciliation at each load frequency
• ELT Framework (Extract, Load and Transform)
• On premise Vs. On cloud validations (in case data lake is being hosted on cloud)
• Data Quality and Standardization validations
• Data partitioning and compaction
6
Structured Data (2/9)
Schema Validation
▪ Data from heterogeneous sources is present on data lakes and table schema of all the tables is
preserved in the source
▪ As a part of schema validation, QA team can cover the following pointers:
• Data type
• Data length
• Null/Not Null constraints
• Delimiters (pay attention to delimiters coming as part of data)
• Special characters (visible and invisible) For e.g., Hex code ‘c2 ad’ is a soft hyphen but
appears as a white space which doesn’t go away when TRIM function is applied during data
comparison
▪ Source metadata and metadata from data lake can be extracted from respective metastores and
compared
• If source is on SSMS, metadata can be retrieved using sp_help stored procedure.
• If data lake is on HDFS, Hive has its own metadata tables such as DBS, TBLS, COLUMNS_V2,
SDS, etc.
▪ Visit https://utf8-chartable.de/unicode-utf8-table.pl for more characters
7
Structured Data (3/9)
Data Masking Validations
▪ Source systems might have PHI/PPI data in unmasked form unless client has anonymized it
beforehand
▪ Data masking logic implementation can be tested based on pre-agreed masking logic
▪ Masking logic can be written in SQL query / Excel formula and output compared with the data
masked by ETL code which has flowed into the data lake
▪ E.g. Unmasked SSN 123-456-7891 needs to be masked as XXX-XX-7891
▪ E.g. Unmasked email abc.def@xyz.com needs to be masked as axx.xxx@xyz.com
▪ Pay attention to unmasked data that is not coming in the expected format which causes masking
logic to fail
8
Data Reconciliation at Each Load Frequency
▪ Data Reconciliation is a testing gate wherein the data which is loaded in target is compared
against data in the source to ensure that no data is dropped/corrupted in the migration process
▪ Record count for each table between source and data lake:
• Initial load
• Incremental load
▪ Truncate load v/s. Append load:
• Truncate load: Data in target table is truncated every time a new data feed is about to be
loaded
• Append load: A new data feed is appended to already existing data in target table
▪ Data validation for each table between source and data lake: Data in each row and column in
source table to be compared with data lake
• To use MS Excel, batching requires to be done to get handful of data
• To compare entire dataset, custom built automation tool can be used
▪ Duplicates in data lake: SQL queries were used to identify if any duplicates were introduced
during data lake load
Structured Data (4/9)
9
Structured Data (5/9)
ELT Framework Validation
▪ Logging of type of source informs if the source data is from a table or flat file
▪ Logging of source connection string (DB link, file path, etc.):
• Indicates the database connection string if source data is coming from database table
• If source is a flat file, this informs the landing location of the file and where to read the data
from
▪ Generation of batch IDs on a fresh run and on rerun upon failure helps identify the data loaded
in a batch every day
▪ Flags such as primary key flag, truncate load flag, critical table flag, upload to cloud flag, etc. help
define the behavior of ELT jobs
▪ Logging of records processed, loaded and rejected for each table show number of records that
are extracted from source, rejected/loaded into target data lake
▪ Polling frequency, trigger check, email notification etc. indicate the frequency to poll for
incoming file/data, trigger next batch, send notifications of batch status, etc.
10
Structured Data (6/9)
On-Premise Vs. On-Cloud Validation
▪ Additional testing is required when sources are hosted on-premise and data lake is being created
on cloud
▪ Validate data types supported by on cloud application (through which data analysts/scientists will
be querying data)
• For e.g., Azure SQL warehouse doesn’t support timestamp/text data types. One needs to
cast source data to datetime2/ varchar respectively
• For e.g., Impala does not support DOUBLE data type which has to be converted to NUMERIC
▪ Validate user group access (who can see what type of data)
▪ Validate masked/unmasked views based on type of users
▪ Validate attribute names with spaces/special characters/reserved keywords between on-premise
and on-cloud
• For e.g., Source attribute named LOCATION and DIV are reserved keyword in Impala hence, it
must be changed to LOC and DIVISION to preserve the meaning
▪ Validate external tables created on HDFS files published through ELT jobs
• For e.g., Validate whether the external table is pointing to correct location on HDFS where
the files are being published by ETL jobs
▪ Validate the data consistency between on-premise and on-cloud
• For e.g., Use custom built validation tools to compare each attribute
11
Structured Data (7/9)
Data Quality and Standardization Validation
▪ Data in data lake needs to be cleansed and standardized for better analysis
▪ Actual data isn’t removed or updated, but flags (Warning / Invalid) can highlight the data quality
▪ Validate various DQ/DS rules with SQL queries on source data and compare output with DQ/DS
tools
▪ For e.g., Null Check DQ rule on MRN (Medical Record Number): When data is processed through
a tool for DQ, all the records with NULL MRN are flagged as WARNING/INVALID along with a
remark column that states MRN is NULL
▪ For e.g., Missing Parent DQ rule on Encounter ID with respect to Patient: When encounter
doesn’t have associated patient in patient table, Encounter ID is flagged as WARNING/INVALID
along with a remark column that states PATIENT MISSING
▪ For e.g., Race Data Standardization:
• Race data in source with codes such as 1001, 1002 need to be standardized with
corresponding description as Hispanic, Asian, etc.
• Based on requirements, standardization can be achieved on reference table or transaction
(data) tables as well
12
Structured Data (8/9)
Data Partitioning and Compaction
▪ When data lake is being created on cloud using Hadoop file system (HDFS), it is preferred to
store data in partitions (based on user-provided partition criteria) and use compaction in order
to minimize multiple read operations from underlying HDFS
▪ Validate whether the data in source is getting partitioned appropriately while being stored in a
data lake on HDFS. For e.g., Partitioning based on Encounter_Create_Date:
• This will create a folder structure in the output by year and will contain encounters in the
file partitioned by year
• Data retrieval will be faster when analysts/scientists query on specified date range since
data is already stored in such partitions
▪ Compaction includes two areas:
• Merging of multiple smaller file chunks into a predefined file size to avoid multiple read
operations
• Converting text format into Parquet/ZIP format to achieve file size reduction
13
Structured Data (9/9)
Data Partitioning and Compaction
▪ Validate size of the files being published on HDFS by ELT jobs by logging into Impala
• For e.g., ELT jobs produced files in .txt format on cloud which had the file size of 3.19 GB,
got reduced to 516 MB after Parquet conversion
▪ Validate merging of multiple smaller files into one or more large files
• For e.g., DQ-DS tool might produce multiple small-sized files (based on storage availability
on underlying data nodes) which can be seen at the output location. Utility can be written
to merge all these files into a single file in Parquet format
14
Structured Data – Challenges
Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special
characters
▪ Delimiters as a part of incoming data
▪ PHI data in a format different than test data may cause masking logic to fail
▪ Backdated inserts not captured in incremental runs
Tools and
Technology
▪ Limitations on data types, reserved keywords, special characters handled by
cloud applications
▪ Date format conversion based on time zone selected during installation
▪ Data retrieval challenges and cost involved for downloading data for
analysis
Configuration /
Environment
▪ Late arriving flat files
▪ Any service breakdown in production cause parts of end-to-end workflow
to break
Others ▪ Project timelines need to accommodate any unknowns in the data during
production deployment and/or few weeks after deployment
15
Semi-structured Data (1/6)
Flow Diagram Of Semi-Structured (JSON) Message Ingestion into Data lake
16
Semi-structured Data (2/6)
Testing Gates
▪ Data Lakes creation and testing for semi-structured data can take place in the following areas:
• JSON Message Validation
• Data Reconciliation
• ELT Framework (Extract, Load and Transform)
• Data Quality and Standardization Validations
17
Semi-structured Data (3/6)
JSON Message Validation
▪ Data lake can be integrated with Kafka Messaging System that produces JSON messages in semi-
structured format
▪ As a part of JSON message validation, QA team can cover the following pointers:
• Compare the JSON Message with the JSON schema provided as part of requirement
• Data Type Check
• Null / Not Null constraints Check
▪ For instance:
JSON Schema JSON Message
{
"ServiceLevel":
{
"type": ["string", "null" ]
},
"ServiceType":
{
"type": ["string"]
}
}
{
"ServiceLevel": "One",
"ServiceType": "Skilled Nurse"
}
18
Semi-structured Data (4/6)
Data Reconciliation
▪ Data Reconciliation is a testing gate wherein the data loaded in target tables is compared against
the data in the source JSON messages to ensure no data is trimmed/corrupted/missed in the
migration process. One can use the following strategies to ensure the same:
▪ Record count for each table between source and data lake
• Simple JSON: Single JSON message ingested in lake is loaded as single row in target table
• Complex JSON: More than one row is loaded in the target table depending on the level of
hierarchy and nesting present in the JSON message
▪ Data validation for each table between source and data lake. Data in each row and column in
JSON message to be compared with data lake
• To use openjson function in SQL Server for parsing the JSON messages and converting them
into structured format
• To compare the parsed output of openjson with the data loaded in target tables using
Python
19
Semi-structured Data (5/6)
ETL Framework Validation
▪ Raw layer Validation on HDFS: Indicates whether the source JSON messages ingested through
Kafka are loaded in HDFS in raw form, before the processing tool selects the messages and loads
it into the target tables
▪ Logging of Kafka details: Informs about Kafka topic, partition, offset, and hostname
▪ Generation of Data lake UIDs helps in identifying JSON messages
▪ Logging of records processed, loaded and rejected as part of each JSON ingestion shows the
amount of records ingested through Kafka, processed, and failed in JSON schema validation with
error logging
▪ Email notification: This shows JSONs ingested through Kafka on hourly basis, JSON count loaded
in raw layer, and daily report which shows JSON wise count, failures and successful ingestion
20
Semi-structured Data (6/6)
Data Quality and Standardization Validation
▪ The data in data lake needs to be cleansed and standardized for better analysis
▪ In this case, the actual data need not be removed or updated but both valid and erroneous data
are logged into the audit tables
▪ Validate JSON messages against JSON schema
• For e.g., Null Check: If non-nullable attribute is assigned null value in the input JSON
message, then ingestion of such message fails, and the error gets logged in audit table with
non-nullable attribute details
21
Semi-structured Data – Challenges
Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special
characters
▪ Test data preparation as per test scenarios
▪ Manual validation of single JSON loaded into multiple tables
▪ Live reconciliation of messages produced through Kafka due to continuous
streaming
Tools and
Technology
▪ Limitations on reserved keywords and special character handling
Configuration /
Environment
▪ Multiple services running simultaneously on cluster results in choking of
JSON messages in QA environment
▪ Any service breakdown in production cause parts of end-to-end workflow
to break
Others ▪ Project timelines need to accommodate any unknowns in the data during
production deployment and/or few weeks after deployment
22
References
▪ https://www.cloudmoyo.com/blog/difference-between-a-data-warehouse-and-a-data-lake/
▪ https://utf8-chartable.de/unicode-utf8-table.pl
▪ https://www.cigniti.com/blog/5-big-data-testing-challenges/
▪ https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/
▪ https://kafka.apache.org/
▪ https://dzone.com/articles/json-drivers-parsing-hierarchical-data
About CitiusTech
3,500+
Healthcare IT professionals worldwide
1,500+
Healthcare software engineering
800+
HL7 certified professionals
30%+
CAGR over last 5 years
110+
Healthcare customers
▪ Healthcare technology companies
▪ Hospitals, IDNs & medical groups
▪ Payers and health plans
▪ ACO, MCO, HIE, HIX, NHIN and RHIO
▪ Pharma & Life Sciences companies
23
Thank You
Authors:
Vaibhav Shahane
Vaibhavi Indap
Technical Lead
thoughtleaders@citiustech.com

More Related Content

What's hot

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsDatabricks
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsThomas Sykes
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 

What's hot (20)

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Data Vault Overview
Data Vault OverviewData Vault Overview
Data Vault Overview
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 

Similar to Testing Strategies for Data Lake Hosted on Hadoop

Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptxJesusaEspeleta
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architectureCosta Pissaris
 
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISCombining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISAnastasija Nikiforova
 
Data Warehouse By Piyush
Data Warehouse By PiyushData Warehouse By Piyush
Data Warehouse By Piyushastronish
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.pptpadalamail
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxnikshaikh786
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Databricks
 
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...Denodo
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
Case Study: A Multi-Source Time Variant Data warehouse
Case Study: A Multi-Source Time Variant Data warehouseCase Study: A Multi-Source Time Variant Data warehouse
Case Study: A Multi-Source Time Variant Data warehousetarun kumar sharma
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 

Similar to Testing Strategies for Data Lake Hosted on Hadoop (20)

Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISCombining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
 
Data Warehouse By Piyush
Data Warehouse By PiyushData Warehouse By Piyush
Data Warehouse By Piyush
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Data Warehousing.pptx
Data Warehousing.pptxData Warehousing.pptx
Data Warehousing.pptx
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Oracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAsOracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAs
 
Case Study: A Multi-Source Time Variant Data warehouse
Case Study: A Multi-Source Time Variant Data warehouseCase Study: A Multi-Source Time Variant Data warehouse
Case Study: A Multi-Source Time Variant Data warehouse
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 

More from CitiusTech

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansCitiusTech
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareCitiusTech
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations CitiusTech
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)CitiusTech
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCitiusTech
 
Accelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOpsAccelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOpsCitiusTech
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life SciencesCitiusTech
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsCitiusTech
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersCitiusTech
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement CitiusTech
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021CitiusTech
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingCitiusTech
 
Progressive Web Apps in Healthcare
Progressive Web Apps in HealthcareProgressive Web Apps in Healthcare
Progressive Web Apps in HealthcareCitiusTech
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in HealthcareCitiusTech
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLPCitiusTech
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureCitiusTech
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchCitiusTech
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketCitiusTech
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsCitiusTech
 
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...CitiusTech
 

More from CitiusTech (20)

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health Plans
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in Healthcare
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An Analysis
 
Accelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOpsAccelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOps
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life Sciences
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk Patients
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for Payers
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation Testing
 
Progressive Web Apps in Healthcare
Progressive Web Apps in HealthcareProgressive Web Apps in Healthcare
Progressive Web Apps in Healthcare
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in Healthcare
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and Future
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes Research
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer Market
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data Analytics
 
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Testing Strategies for Data Lake Hosted on Hadoop

  • 1. This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech. Testing Strategies for Data Lake Hosted on Hadoop August 2019 | Authors: Vaibhav Shahane and Vaibhavi Indap CitiusTech Thought Leadership
  • 2. 2 Agenda ▪ Overview ▪ Data Lake v/s Data Warehouse ▪ Structured Data ▪ Semi-structured Data ▪ References
  • 3. 3 Overview ▪ Data lake is a repository to store massive amount of data in its native form ▪ Businesses have disparate sources of data which are difficult to analyze unless brought together on a single platform (common pool of data) ▪ Data lake allows business decision makers/data analysts/data scientists to get holistic view of data coming in from heterogeneous sources
  • 4. 4 Data Lake v/s Data Warehouse Similarities Differences ▪ Data Lakes maintain heterogeneous sources in a single pool ▪ It provides better access to enterprise wide data to analysts and scientists ▪ Data is highly organized and structured in data warehouses ▪ Data lake uses flat structure and original schema ▪ Data present in data warehouses is transformed, aggregated and may lose its original schema ▪ Data warehouses provide transactional solutions by enabling analysts to drill down/up/through specific areas of business ▪ Data lakes answer questions that aren’t structured but need discovery using iterative algorithms and/or complex mathematical functions
  • 5. 5 Structured Data (1/9) Testing Gates ▪ Data Lakes creation and testing can take place around the following areas: • Schema validations • Data Masking validations • Data Reconciliation at each load frequency • ELT Framework (Extract, Load and Transform) • On premise Vs. On cloud validations (in case data lake is being hosted on cloud) • Data Quality and Standardization validations • Data partitioning and compaction
  • 6. 6 Structured Data (2/9) Schema Validation ▪ Data from heterogeneous sources is present on data lakes and table schema of all the tables is preserved in the source ▪ As a part of schema validation, QA team can cover the following pointers: • Data type • Data length • Null/Not Null constraints • Delimiters (pay attention to delimiters coming as part of data) • Special characters (visible and invisible) For e.g., Hex code ‘c2 ad’ is a soft hyphen but appears as a white space which doesn’t go away when TRIM function is applied during data comparison ▪ Source metadata and metadata from data lake can be extracted from respective metastores and compared • If source is on SSMS, metadata can be retrieved using sp_help stored procedure. • If data lake is on HDFS, Hive has its own metadata tables such as DBS, TBLS, COLUMNS_V2, SDS, etc. ▪ Visit https://utf8-chartable.de/unicode-utf8-table.pl for more characters
  • 7. 7 Structured Data (3/9) Data Masking Validations ▪ Source systems might have PHI/PPI data in unmasked form unless client has anonymized it beforehand ▪ Data masking logic implementation can be tested based on pre-agreed masking logic ▪ Masking logic can be written in SQL query / Excel formula and output compared with the data masked by ETL code which has flowed into the data lake ▪ E.g. Unmasked SSN 123-456-7891 needs to be masked as XXX-XX-7891 ▪ E.g. Unmasked email abc.def@xyz.com needs to be masked as axx.xxx@xyz.com ▪ Pay attention to unmasked data that is not coming in the expected format which causes masking logic to fail
  • 8. 8 Data Reconciliation at Each Load Frequency ▪ Data Reconciliation is a testing gate wherein the data which is loaded in target is compared against data in the source to ensure that no data is dropped/corrupted in the migration process ▪ Record count for each table between source and data lake: • Initial load • Incremental load ▪ Truncate load v/s. Append load: • Truncate load: Data in target table is truncated every time a new data feed is about to be loaded • Append load: A new data feed is appended to already existing data in target table ▪ Data validation for each table between source and data lake: Data in each row and column in source table to be compared with data lake • To use MS Excel, batching requires to be done to get handful of data • To compare entire dataset, custom built automation tool can be used ▪ Duplicates in data lake: SQL queries were used to identify if any duplicates were introduced during data lake load Structured Data (4/9)
  • 9. 9 Structured Data (5/9) ELT Framework Validation ▪ Logging of type of source informs if the source data is from a table or flat file ▪ Logging of source connection string (DB link, file path, etc.): • Indicates the database connection string if source data is coming from database table • If source is a flat file, this informs the landing location of the file and where to read the data from ▪ Generation of batch IDs on a fresh run and on rerun upon failure helps identify the data loaded in a batch every day ▪ Flags such as primary key flag, truncate load flag, critical table flag, upload to cloud flag, etc. help define the behavior of ELT jobs ▪ Logging of records processed, loaded and rejected for each table show number of records that are extracted from source, rejected/loaded into target data lake ▪ Polling frequency, trigger check, email notification etc. indicate the frequency to poll for incoming file/data, trigger next batch, send notifications of batch status, etc.
  • 10. 10 Structured Data (6/9) On-Premise Vs. On-Cloud Validation ▪ Additional testing is required when sources are hosted on-premise and data lake is being created on cloud ▪ Validate data types supported by on cloud application (through which data analysts/scientists will be querying data) • For e.g., Azure SQL warehouse doesn’t support timestamp/text data types. One needs to cast source data to datetime2/ varchar respectively • For e.g., Impala does not support DOUBLE data type which has to be converted to NUMERIC ▪ Validate user group access (who can see what type of data) ▪ Validate masked/unmasked views based on type of users ▪ Validate attribute names with spaces/special characters/reserved keywords between on-premise and on-cloud • For e.g., Source attribute named LOCATION and DIV are reserved keyword in Impala hence, it must be changed to LOC and DIVISION to preserve the meaning ▪ Validate external tables created on HDFS files published through ELT jobs • For e.g., Validate whether the external table is pointing to correct location on HDFS where the files are being published by ETL jobs ▪ Validate the data consistency between on-premise and on-cloud • For e.g., Use custom built validation tools to compare each attribute
  • 11. 11 Structured Data (7/9) Data Quality and Standardization Validation ▪ Data in data lake needs to be cleansed and standardized for better analysis ▪ Actual data isn’t removed or updated, but flags (Warning / Invalid) can highlight the data quality ▪ Validate various DQ/DS rules with SQL queries on source data and compare output with DQ/DS tools ▪ For e.g., Null Check DQ rule on MRN (Medical Record Number): When data is processed through a tool for DQ, all the records with NULL MRN are flagged as WARNING/INVALID along with a remark column that states MRN is NULL ▪ For e.g., Missing Parent DQ rule on Encounter ID with respect to Patient: When encounter doesn’t have associated patient in patient table, Encounter ID is flagged as WARNING/INVALID along with a remark column that states PATIENT MISSING ▪ For e.g., Race Data Standardization: • Race data in source with codes such as 1001, 1002 need to be standardized with corresponding description as Hispanic, Asian, etc. • Based on requirements, standardization can be achieved on reference table or transaction (data) tables as well
  • 12. 12 Structured Data (8/9) Data Partitioning and Compaction ▪ When data lake is being created on cloud using Hadoop file system (HDFS), it is preferred to store data in partitions (based on user-provided partition criteria) and use compaction in order to minimize multiple read operations from underlying HDFS ▪ Validate whether the data in source is getting partitioned appropriately while being stored in a data lake on HDFS. For e.g., Partitioning based on Encounter_Create_Date: • This will create a folder structure in the output by year and will contain encounters in the file partitioned by year • Data retrieval will be faster when analysts/scientists query on specified date range since data is already stored in such partitions ▪ Compaction includes two areas: • Merging of multiple smaller file chunks into a predefined file size to avoid multiple read operations • Converting text format into Parquet/ZIP format to achieve file size reduction
  • 13. 13 Structured Data (9/9) Data Partitioning and Compaction ▪ Validate size of the files being published on HDFS by ELT jobs by logging into Impala • For e.g., ELT jobs produced files in .txt format on cloud which had the file size of 3.19 GB, got reduced to 516 MB after Parquet conversion ▪ Validate merging of multiple smaller files into one or more large files • For e.g., DQ-DS tool might produce multiple small-sized files (based on storage availability on underlying data nodes) which can be seen at the output location. Utility can be written to merge all these files into a single file in Parquet format
  • 14. 14 Structured Data – Challenges Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special characters ▪ Delimiters as a part of incoming data ▪ PHI data in a format different than test data may cause masking logic to fail ▪ Backdated inserts not captured in incremental runs Tools and Technology ▪ Limitations on data types, reserved keywords, special characters handled by cloud applications ▪ Date format conversion based on time zone selected during installation ▪ Data retrieval challenges and cost involved for downloading data for analysis Configuration / Environment ▪ Late arriving flat files ▪ Any service breakdown in production cause parts of end-to-end workflow to break Others ▪ Project timelines need to accommodate any unknowns in the data during production deployment and/or few weeks after deployment
  • 15. 15 Semi-structured Data (1/6) Flow Diagram Of Semi-Structured (JSON) Message Ingestion into Data lake
  • 16. 16 Semi-structured Data (2/6) Testing Gates ▪ Data Lakes creation and testing for semi-structured data can take place in the following areas: • JSON Message Validation • Data Reconciliation • ELT Framework (Extract, Load and Transform) • Data Quality and Standardization Validations
  • 17. 17 Semi-structured Data (3/6) JSON Message Validation ▪ Data lake can be integrated with Kafka Messaging System that produces JSON messages in semi- structured format ▪ As a part of JSON message validation, QA team can cover the following pointers: • Compare the JSON Message with the JSON schema provided as part of requirement • Data Type Check • Null / Not Null constraints Check ▪ For instance: JSON Schema JSON Message { "ServiceLevel": { "type": ["string", "null" ] }, "ServiceType": { "type": ["string"] } } { "ServiceLevel": "One", "ServiceType": "Skilled Nurse" }
  • 18. 18 Semi-structured Data (4/6) Data Reconciliation ▪ Data Reconciliation is a testing gate wherein the data loaded in target tables is compared against the data in the source JSON messages to ensure no data is trimmed/corrupted/missed in the migration process. One can use the following strategies to ensure the same: ▪ Record count for each table between source and data lake • Simple JSON: Single JSON message ingested in lake is loaded as single row in target table • Complex JSON: More than one row is loaded in the target table depending on the level of hierarchy and nesting present in the JSON message ▪ Data validation for each table between source and data lake. Data in each row and column in JSON message to be compared with data lake • To use openjson function in SQL Server for parsing the JSON messages and converting them into structured format • To compare the parsed output of openjson with the data loaded in target tables using Python
  • 19. 19 Semi-structured Data (5/6) ETL Framework Validation ▪ Raw layer Validation on HDFS: Indicates whether the source JSON messages ingested through Kafka are loaded in HDFS in raw form, before the processing tool selects the messages and loads it into the target tables ▪ Logging of Kafka details: Informs about Kafka topic, partition, offset, and hostname ▪ Generation of Data lake UIDs helps in identifying JSON messages ▪ Logging of records processed, loaded and rejected as part of each JSON ingestion shows the amount of records ingested through Kafka, processed, and failed in JSON schema validation with error logging ▪ Email notification: This shows JSONs ingested through Kafka on hourly basis, JSON count loaded in raw layer, and daily report which shows JSON wise count, failures and successful ingestion
  • 20. 20 Semi-structured Data (6/6) Data Quality and Standardization Validation ▪ The data in data lake needs to be cleansed and standardized for better analysis ▪ In this case, the actual data need not be removed or updated but both valid and erroneous data are logged into the audit tables ▪ Validate JSON messages against JSON schema • For e.g., Null Check: If non-nullable attribute is assigned null value in the input JSON message, then ingestion of such message fails, and the error gets logged in audit table with non-nullable attribute details
  • 21. 21 Semi-structured Data – Challenges Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special characters ▪ Test data preparation as per test scenarios ▪ Manual validation of single JSON loaded into multiple tables ▪ Live reconciliation of messages produced through Kafka due to continuous streaming Tools and Technology ▪ Limitations on reserved keywords and special character handling Configuration / Environment ▪ Multiple services running simultaneously on cluster results in choking of JSON messages in QA environment ▪ Any service breakdown in production cause parts of end-to-end workflow to break Others ▪ Project timelines need to accommodate any unknowns in the data during production deployment and/or few weeks after deployment
  • 22. 22 References ▪ https://www.cloudmoyo.com/blog/difference-between-a-data-warehouse-and-a-data-lake/ ▪ https://utf8-chartable.de/unicode-utf8-table.pl ▪ https://www.cigniti.com/blog/5-big-data-testing-challenges/ ▪ https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/ ▪ https://kafka.apache.org/ ▪ https://dzone.com/articles/json-drivers-parsing-hierarchical-data
  • 23. About CitiusTech 3,500+ Healthcare IT professionals worldwide 1,500+ Healthcare software engineering 800+ HL7 certified professionals 30%+ CAGR over last 5 years 110+ Healthcare customers ▪ Healthcare technology companies ▪ Hospitals, IDNs & medical groups ▪ Payers and health plans ▪ ACO, MCO, HIE, HIX, NHIN and RHIO ▪ Pharma & Life Sciences companies 23 Thank You Authors: Vaibhav Shahane Vaibhavi Indap Technical Lead thoughtleaders@citiustech.com