SlideShare a Scribd company logo
This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
Testing Strategies for Data Lake
Hosted on Hadoop
August 2019 | Authors: Vaibhav Shahane and Vaibhavi Indap
CitiusTech Thought
Leadership
2
Agenda
▪ Overview
▪ Data Lake v/s Data Warehouse
▪ Structured Data
▪ Semi-structured Data
▪ References
3
Overview
▪ Data lake is a repository to store massive amount of data in its native
form
▪ Businesses have disparate sources of data which are difficult to
analyze unless brought together on a single platform (common pool
of data)
▪ Data lake allows business decision makers/data analysts/data
scientists to get holistic view of data coming in from heterogeneous
sources
4
Data Lake v/s Data Warehouse
Similarities Differences
▪ Data Lakes maintain heterogeneous
sources in a single pool
▪ It provides better access to enterprise
wide data to analysts and scientists
▪ Data is highly organized and structured in data
warehouses
▪ Data lake uses flat structure and original
schema
▪ Data present in data warehouses is
transformed, aggregated and may lose its
original schema
▪ Data warehouses provide transactional
solutions by enabling analysts to drill
down/up/through specific areas of business
▪ Data lakes answer questions that aren’t
structured but need discovery using iterative
algorithms and/or complex mathematical
functions
5
Structured Data (1/9)
Testing Gates
▪ Data Lakes creation and testing can take place around the following areas:
• Schema validations
• Data Masking validations
• Data Reconciliation at each load frequency
• ELT Framework (Extract, Load and Transform)
• On premise Vs. On cloud validations (in case data lake is being hosted on cloud)
• Data Quality and Standardization validations
• Data partitioning and compaction
6
Structured Data (2/9)
Schema Validation
▪ Data from heterogeneous sources is present on data lakes and table schema of all the tables is
preserved in the source
▪ As a part of schema validation, QA team can cover the following pointers:
• Data type
• Data length
• Null/Not Null constraints
• Delimiters (pay attention to delimiters coming as part of data)
• Special characters (visible and invisible) For e.g., Hex code ‘c2 ad’ is a soft hyphen but
appears as a white space which doesn’t go away when TRIM function is applied during data
comparison
▪ Source metadata and metadata from data lake can be extracted from respective metastores and
compared
• If source is on SSMS, metadata can be retrieved using sp_help stored procedure.
• If data lake is on HDFS, Hive has its own metadata tables such as DBS, TBLS, COLUMNS_V2,
SDS, etc.
▪ Visit https://utf8-chartable.de/unicode-utf8-table.pl for more characters
7
Structured Data (3/9)
Data Masking Validations
▪ Source systems might have PHI/PPI data in unmasked form unless client has anonymized it
beforehand
▪ Data masking logic implementation can be tested based on pre-agreed masking logic
▪ Masking logic can be written in SQL query / Excel formula and output compared with the data
masked by ETL code which has flowed into the data lake
▪ E.g. Unmasked SSN 123-456-7891 needs to be masked as XXX-XX-7891
▪ E.g. Unmasked email abc.def@xyz.com needs to be masked as axx.xxx@xyz.com
▪ Pay attention to unmasked data that is not coming in the expected format which causes masking
logic to fail
8
Data Reconciliation at Each Load Frequency
▪ Data Reconciliation is a testing gate wherein the data which is loaded in target is compared
against data in the source to ensure that no data is dropped/corrupted in the migration process
▪ Record count for each table between source and data lake:
• Initial load
• Incremental load
▪ Truncate load v/s. Append load:
• Truncate load: Data in target table is truncated every time a new data feed is about to be
loaded
• Append load: A new data feed is appended to already existing data in target table
▪ Data validation for each table between source and data lake: Data in each row and column in
source table to be compared with data lake
• To use MS Excel, batching requires to be done to get handful of data
• To compare entire dataset, custom built automation tool can be used
▪ Duplicates in data lake: SQL queries were used to identify if any duplicates were introduced
during data lake load
Structured Data (4/9)
9
Structured Data (5/9)
ELT Framework Validation
▪ Logging of type of source informs if the source data is from a table or flat file
▪ Logging of source connection string (DB link, file path, etc.):
• Indicates the database connection string if source data is coming from database table
• If source is a flat file, this informs the landing location of the file and where to read the data
from
▪ Generation of batch IDs on a fresh run and on rerun upon failure helps identify the data loaded
in a batch every day
▪ Flags such as primary key flag, truncate load flag, critical table flag, upload to cloud flag, etc. help
define the behavior of ELT jobs
▪ Logging of records processed, loaded and rejected for each table show number of records that
are extracted from source, rejected/loaded into target data lake
▪ Polling frequency, trigger check, email notification etc. indicate the frequency to poll for
incoming file/data, trigger next batch, send notifications of batch status, etc.
10
Structured Data (6/9)
On-Premise Vs. On-Cloud Validation
▪ Additional testing is required when sources are hosted on-premise and data lake is being created
on cloud
▪ Validate data types supported by on cloud application (through which data analysts/scientists will
be querying data)
• For e.g., Azure SQL warehouse doesn’t support timestamp/text data types. One needs to
cast source data to datetime2/ varchar respectively
• For e.g., Impala does not support DOUBLE data type which has to be converted to NUMERIC
▪ Validate user group access (who can see what type of data)
▪ Validate masked/unmasked views based on type of users
▪ Validate attribute names with spaces/special characters/reserved keywords between on-premise
and on-cloud
• For e.g., Source attribute named LOCATION and DIV are reserved keyword in Impala hence, it
must be changed to LOC and DIVISION to preserve the meaning
▪ Validate external tables created on HDFS files published through ELT jobs
• For e.g., Validate whether the external table is pointing to correct location on HDFS where
the files are being published by ETL jobs
▪ Validate the data consistency between on-premise and on-cloud
• For e.g., Use custom built validation tools to compare each attribute
11
Structured Data (7/9)
Data Quality and Standardization Validation
▪ Data in data lake needs to be cleansed and standardized for better analysis
▪ Actual data isn’t removed or updated, but flags (Warning / Invalid) can highlight the data quality
▪ Validate various DQ/DS rules with SQL queries on source data and compare output with DQ/DS
tools
▪ For e.g., Null Check DQ rule on MRN (Medical Record Number): When data is processed through
a tool for DQ, all the records with NULL MRN are flagged as WARNING/INVALID along with a
remark column that states MRN is NULL
▪ For e.g., Missing Parent DQ rule on Encounter ID with respect to Patient: When encounter
doesn’t have associated patient in patient table, Encounter ID is flagged as WARNING/INVALID
along with a remark column that states PATIENT MISSING
▪ For e.g., Race Data Standardization:
• Race data in source with codes such as 1001, 1002 need to be standardized with
corresponding description as Hispanic, Asian, etc.
• Based on requirements, standardization can be achieved on reference table or transaction
(data) tables as well
12
Structured Data (8/9)
Data Partitioning and Compaction
▪ When data lake is being created on cloud using Hadoop file system (HDFS), it is preferred to
store data in partitions (based on user-provided partition criteria) and use compaction in order
to minimize multiple read operations from underlying HDFS
▪ Validate whether the data in source is getting partitioned appropriately while being stored in a
data lake on HDFS. For e.g., Partitioning based on Encounter_Create_Date:
• This will create a folder structure in the output by year and will contain encounters in the
file partitioned by year
• Data retrieval will be faster when analysts/scientists query on specified date range since
data is already stored in such partitions
▪ Compaction includes two areas:
• Merging of multiple smaller file chunks into a predefined file size to avoid multiple read
operations
• Converting text format into Parquet/ZIP format to achieve file size reduction
13
Structured Data (9/9)
Data Partitioning and Compaction
▪ Validate size of the files being published on HDFS by ELT jobs by logging into Impala
• For e.g., ELT jobs produced files in .txt format on cloud which had the file size of 3.19 GB,
got reduced to 516 MB after Parquet conversion
▪ Validate merging of multiple smaller files into one or more large files
• For e.g., DQ-DS tool might produce multiple small-sized files (based on storage availability
on underlying data nodes) which can be seen at the output location. Utility can be written
to merge all these files into a single file in Parquet format
14
Structured Data – Challenges
Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special
characters
▪ Delimiters as a part of incoming data
▪ PHI data in a format different than test data may cause masking logic to fail
▪ Backdated inserts not captured in incremental runs
Tools and
Technology
▪ Limitations on data types, reserved keywords, special characters handled by
cloud applications
▪ Date format conversion based on time zone selected during installation
▪ Data retrieval challenges and cost involved for downloading data for
analysis
Configuration /
Environment
▪ Late arriving flat files
▪ Any service breakdown in production cause parts of end-to-end workflow
to break
Others ▪ Project timelines need to accommodate any unknowns in the data during
production deployment and/or few weeks after deployment
15
Semi-structured Data (1/6)
Flow Diagram Of Semi-Structured (JSON) Message Ingestion into Data lake
16
Semi-structured Data (2/6)
Testing Gates
▪ Data Lakes creation and testing for semi-structured data can take place in the following areas:
• JSON Message Validation
• Data Reconciliation
• ELT Framework (Extract, Load and Transform)
• Data Quality and Standardization Validations
17
Semi-structured Data (3/6)
JSON Message Validation
▪ Data lake can be integrated with Kafka Messaging System that produces JSON messages in semi-
structured format
▪ As a part of JSON message validation, QA team can cover the following pointers:
• Compare the JSON Message with the JSON schema provided as part of requirement
• Data Type Check
• Null / Not Null constraints Check
▪ For instance:
JSON Schema JSON Message
{
"ServiceLevel":
{
"type": ["string", "null" ]
},
"ServiceType":
{
"type": ["string"]
}
}
{
"ServiceLevel": "One",
"ServiceType": "Skilled Nurse"
}
18
Semi-structured Data (4/6)
Data Reconciliation
▪ Data Reconciliation is a testing gate wherein the data loaded in target tables is compared against
the data in the source JSON messages to ensure no data is trimmed/corrupted/missed in the
migration process. One can use the following strategies to ensure the same:
▪ Record count for each table between source and data lake
• Simple JSON: Single JSON message ingested in lake is loaded as single row in target table
• Complex JSON: More than one row is loaded in the target table depending on the level of
hierarchy and nesting present in the JSON message
▪ Data validation for each table between source and data lake. Data in each row and column in
JSON message to be compared with data lake
• To use openjson function in SQL Server for parsing the JSON messages and converting them
into structured format
• To compare the parsed output of openjson with the data loaded in target tables using
Python
19
Semi-structured Data (5/6)
ETL Framework Validation
▪ Raw layer Validation on HDFS: Indicates whether the source JSON messages ingested through
Kafka are loaded in HDFS in raw form, before the processing tool selects the messages and loads
it into the target tables
▪ Logging of Kafka details: Informs about Kafka topic, partition, offset, and hostname
▪ Generation of Data lake UIDs helps in identifying JSON messages
▪ Logging of records processed, loaded and rejected as part of each JSON ingestion shows the
amount of records ingested through Kafka, processed, and failed in JSON schema validation with
error logging
▪ Email notification: This shows JSONs ingested through Kafka on hourly basis, JSON count loaded
in raw layer, and daily report which shows JSON wise count, failures and successful ingestion
20
Semi-structured Data (6/6)
Data Quality and Standardization Validation
▪ The data in data lake needs to be cleansed and standardized for better analysis
▪ In this case, the actual data need not be removed or updated but both valid and erroneous data
are logged into the audit tables
▪ Validate JSON messages against JSON schema
• For e.g., Null Check: If non-nullable attribute is assigned null value in the input JSON
message, then ingestion of such message fails, and the error gets logged in audit table with
non-nullable attribute details
21
Semi-structured Data – Challenges
Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special
characters
▪ Test data preparation as per test scenarios
▪ Manual validation of single JSON loaded into multiple tables
▪ Live reconciliation of messages produced through Kafka due to continuous
streaming
Tools and
Technology
▪ Limitations on reserved keywords and special character handling
Configuration /
Environment
▪ Multiple services running simultaneously on cluster results in choking of
JSON messages in QA environment
▪ Any service breakdown in production cause parts of end-to-end workflow
to break
Others ▪ Project timelines need to accommodate any unknowns in the data during
production deployment and/or few weeks after deployment
22
References
▪ https://www.cloudmoyo.com/blog/difference-between-a-data-warehouse-and-a-data-lake/
▪ https://utf8-chartable.de/unicode-utf8-table.pl
▪ https://www.cigniti.com/blog/5-big-data-testing-challenges/
▪ https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/
▪ https://kafka.apache.org/
▪ https://dzone.com/articles/json-drivers-parsing-hierarchical-data
About CitiusTech
3,500+
Healthcare IT professionals worldwide
1,500+
Healthcare software engineering
800+
HL7 certified professionals
30%+
CAGR over last 5 years
110+
Healthcare customers
▪ Healthcare technology companies
▪ Hospitals, IDNs & medical groups
▪ Payers and health plans
▪ ACO, MCO, HIE, HIX, NHIN and RHIO
▪ Pharma & Life Sciences companies
23
Thank You
Authors:
Vaibhav Shahane
Vaibhavi Indap
Technical Lead
thoughtleaders@citiustech.com

More Related Content

What's hot

Data mesh
Data meshData mesh
Data mesh
ManojKumarR41
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
Sergio Zenatti Filho
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
Kent Graziano
 
Straight Talk to Demystify Data Lineage
Straight Talk to Demystify Data LineageStraight Talk to Demystify Data Lineage
Straight Talk to Demystify Data Lineage
DATAVERSITY
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
Data Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and GovernanceData Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and Governance
DATAVERSITY
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
Snowflake Computing
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
HostedbyConfluent
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)
DATAVERSITY
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data Governance
DATAVERSITY
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
Tuba Yaman Him
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
Snowflake Computing
 
Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides
SlideTeam
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 

What's hot (20)

Data mesh
Data meshData mesh
Data mesh
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Straight Talk to Demystify Data Lineage
Straight Talk to Demystify Data LineageStraight Talk to Demystify Data Lineage
Straight Talk to Demystify Data Lineage
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
Data Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and GovernanceData Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and Governance
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 

Similar to Testing Strategies for Data Lake Hosted on Hadoop

Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
JesusaEspeleta
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
Costa Pissaris
 
NoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint PresentationNoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint Presentation
AnweshMishra21
 
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISCombining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Anastasija Nikiforova
 
Data Warehouse By Piyush
Data Warehouse By PiyushData Warehouse By Piyush
Data Warehouse By Piyush
astronish
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
Bob Pusateri
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
padalamail
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Data Warehousing.pptx
Data Warehousing.pptxData Warehousing.pptx
Data Warehousing.pptx
RashilaShrestha
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
nikshaikh786
 
Basic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AUBasic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AU
infant2404
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Databricks
 
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Denodo
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
Dev EngineersSaathi
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
Yogendra Uikey
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Leveraging SAS for Efficient Data Warehousing.pptx
Leveraging SAS for Efficient Data Warehousing.pptxLeveraging SAS for Efficient Data Warehousing.pptx
Leveraging SAS for Efficient Data Warehousing.pptx
mjsale781
 

Similar to Testing Strategies for Data Lake Hosted on Hadoop (20)

Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
NoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint PresentationNoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint Presentation
 
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISCombining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
 
Data Warehouse By Piyush
Data Warehouse By PiyushData Warehouse By Piyush
Data Warehouse By Piyush
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Data Warehousing.pptx
Data Warehousing.pptxData Warehousing.pptx
Data Warehousing.pptx
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Basic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AUBasic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AU
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Leveraging SAS for Efficient Data Warehousing.pptx
Leveraging SAS for Efficient Data Warehousing.pptxLeveraging SAS for Efficient Data Warehousing.pptx
Leveraging SAS for Efficient Data Warehousing.pptx
 

More from CitiusTech

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health Plans
CitiusTech
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in Healthcare
CitiusTech
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations
CitiusTech
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)
CitiusTech
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CitiusTech
 
Accelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOpsAccelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOps
CitiusTech
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life Sciences
CitiusTech
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk Patients
CitiusTech
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for Payers
CitiusTech
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement
CitiusTech
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
CitiusTech
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation Testing
CitiusTech
 
Progressive Web Apps in Healthcare
Progressive Web Apps in HealthcareProgressive Web Apps in Healthcare
Progressive Web Apps in Healthcare
CitiusTech
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in Healthcare
CitiusTech
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP
CitiusTech
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and Future
CitiusTech
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes Research
CitiusTech
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer Market
CitiusTech
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data Analytics
CitiusTech
 
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
CitiusTech
 

More from CitiusTech (20)

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health Plans
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in Healthcare
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An Analysis
 
Accelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOpsAccelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOps
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life Sciences
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk Patients
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for Payers
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation Testing
 
Progressive Web Apps in Healthcare
Progressive Web Apps in HealthcareProgressive Web Apps in Healthcare
Progressive Web Apps in Healthcare
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in Healthcare
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and Future
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes Research
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer Market
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data Analytics
 
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
 

Recently uploaded

Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Ukraine
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 

Recently uploaded (20)

Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 

Testing Strategies for Data Lake Hosted on Hadoop

  • 1. This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech. Testing Strategies for Data Lake Hosted on Hadoop August 2019 | Authors: Vaibhav Shahane and Vaibhavi Indap CitiusTech Thought Leadership
  • 2. 2 Agenda ▪ Overview ▪ Data Lake v/s Data Warehouse ▪ Structured Data ▪ Semi-structured Data ▪ References
  • 3. 3 Overview ▪ Data lake is a repository to store massive amount of data in its native form ▪ Businesses have disparate sources of data which are difficult to analyze unless brought together on a single platform (common pool of data) ▪ Data lake allows business decision makers/data analysts/data scientists to get holistic view of data coming in from heterogeneous sources
  • 4. 4 Data Lake v/s Data Warehouse Similarities Differences ▪ Data Lakes maintain heterogeneous sources in a single pool ▪ It provides better access to enterprise wide data to analysts and scientists ▪ Data is highly organized and structured in data warehouses ▪ Data lake uses flat structure and original schema ▪ Data present in data warehouses is transformed, aggregated and may lose its original schema ▪ Data warehouses provide transactional solutions by enabling analysts to drill down/up/through specific areas of business ▪ Data lakes answer questions that aren’t structured but need discovery using iterative algorithms and/or complex mathematical functions
  • 5. 5 Structured Data (1/9) Testing Gates ▪ Data Lakes creation and testing can take place around the following areas: • Schema validations • Data Masking validations • Data Reconciliation at each load frequency • ELT Framework (Extract, Load and Transform) • On premise Vs. On cloud validations (in case data lake is being hosted on cloud) • Data Quality and Standardization validations • Data partitioning and compaction
  • 6. 6 Structured Data (2/9) Schema Validation ▪ Data from heterogeneous sources is present on data lakes and table schema of all the tables is preserved in the source ▪ As a part of schema validation, QA team can cover the following pointers: • Data type • Data length • Null/Not Null constraints • Delimiters (pay attention to delimiters coming as part of data) • Special characters (visible and invisible) For e.g., Hex code ‘c2 ad’ is a soft hyphen but appears as a white space which doesn’t go away when TRIM function is applied during data comparison ▪ Source metadata and metadata from data lake can be extracted from respective metastores and compared • If source is on SSMS, metadata can be retrieved using sp_help stored procedure. • If data lake is on HDFS, Hive has its own metadata tables such as DBS, TBLS, COLUMNS_V2, SDS, etc. ▪ Visit https://utf8-chartable.de/unicode-utf8-table.pl for more characters
  • 7. 7 Structured Data (3/9) Data Masking Validations ▪ Source systems might have PHI/PPI data in unmasked form unless client has anonymized it beforehand ▪ Data masking logic implementation can be tested based on pre-agreed masking logic ▪ Masking logic can be written in SQL query / Excel formula and output compared with the data masked by ETL code which has flowed into the data lake ▪ E.g. Unmasked SSN 123-456-7891 needs to be masked as XXX-XX-7891 ▪ E.g. Unmasked email abc.def@xyz.com needs to be masked as axx.xxx@xyz.com ▪ Pay attention to unmasked data that is not coming in the expected format which causes masking logic to fail
  • 8. 8 Data Reconciliation at Each Load Frequency ▪ Data Reconciliation is a testing gate wherein the data which is loaded in target is compared against data in the source to ensure that no data is dropped/corrupted in the migration process ▪ Record count for each table between source and data lake: • Initial load • Incremental load ▪ Truncate load v/s. Append load: • Truncate load: Data in target table is truncated every time a new data feed is about to be loaded • Append load: A new data feed is appended to already existing data in target table ▪ Data validation for each table between source and data lake: Data in each row and column in source table to be compared with data lake • To use MS Excel, batching requires to be done to get handful of data • To compare entire dataset, custom built automation tool can be used ▪ Duplicates in data lake: SQL queries were used to identify if any duplicates were introduced during data lake load Structured Data (4/9)
  • 9. 9 Structured Data (5/9) ELT Framework Validation ▪ Logging of type of source informs if the source data is from a table or flat file ▪ Logging of source connection string (DB link, file path, etc.): • Indicates the database connection string if source data is coming from database table • If source is a flat file, this informs the landing location of the file and where to read the data from ▪ Generation of batch IDs on a fresh run and on rerun upon failure helps identify the data loaded in a batch every day ▪ Flags such as primary key flag, truncate load flag, critical table flag, upload to cloud flag, etc. help define the behavior of ELT jobs ▪ Logging of records processed, loaded and rejected for each table show number of records that are extracted from source, rejected/loaded into target data lake ▪ Polling frequency, trigger check, email notification etc. indicate the frequency to poll for incoming file/data, trigger next batch, send notifications of batch status, etc.
  • 10. 10 Structured Data (6/9) On-Premise Vs. On-Cloud Validation ▪ Additional testing is required when sources are hosted on-premise and data lake is being created on cloud ▪ Validate data types supported by on cloud application (through which data analysts/scientists will be querying data) • For e.g., Azure SQL warehouse doesn’t support timestamp/text data types. One needs to cast source data to datetime2/ varchar respectively • For e.g., Impala does not support DOUBLE data type which has to be converted to NUMERIC ▪ Validate user group access (who can see what type of data) ▪ Validate masked/unmasked views based on type of users ▪ Validate attribute names with spaces/special characters/reserved keywords between on-premise and on-cloud • For e.g., Source attribute named LOCATION and DIV are reserved keyword in Impala hence, it must be changed to LOC and DIVISION to preserve the meaning ▪ Validate external tables created on HDFS files published through ELT jobs • For e.g., Validate whether the external table is pointing to correct location on HDFS where the files are being published by ETL jobs ▪ Validate the data consistency between on-premise and on-cloud • For e.g., Use custom built validation tools to compare each attribute
  • 11. 11 Structured Data (7/9) Data Quality and Standardization Validation ▪ Data in data lake needs to be cleansed and standardized for better analysis ▪ Actual data isn’t removed or updated, but flags (Warning / Invalid) can highlight the data quality ▪ Validate various DQ/DS rules with SQL queries on source data and compare output with DQ/DS tools ▪ For e.g., Null Check DQ rule on MRN (Medical Record Number): When data is processed through a tool for DQ, all the records with NULL MRN are flagged as WARNING/INVALID along with a remark column that states MRN is NULL ▪ For e.g., Missing Parent DQ rule on Encounter ID with respect to Patient: When encounter doesn’t have associated patient in patient table, Encounter ID is flagged as WARNING/INVALID along with a remark column that states PATIENT MISSING ▪ For e.g., Race Data Standardization: • Race data in source with codes such as 1001, 1002 need to be standardized with corresponding description as Hispanic, Asian, etc. • Based on requirements, standardization can be achieved on reference table or transaction (data) tables as well
  • 12. 12 Structured Data (8/9) Data Partitioning and Compaction ▪ When data lake is being created on cloud using Hadoop file system (HDFS), it is preferred to store data in partitions (based on user-provided partition criteria) and use compaction in order to minimize multiple read operations from underlying HDFS ▪ Validate whether the data in source is getting partitioned appropriately while being stored in a data lake on HDFS. For e.g., Partitioning based on Encounter_Create_Date: • This will create a folder structure in the output by year and will contain encounters in the file partitioned by year • Data retrieval will be faster when analysts/scientists query on specified date range since data is already stored in such partitions ▪ Compaction includes two areas: • Merging of multiple smaller file chunks into a predefined file size to avoid multiple read operations • Converting text format into Parquet/ZIP format to achieve file size reduction
  • 13. 13 Structured Data (9/9) Data Partitioning and Compaction ▪ Validate size of the files being published on HDFS by ELT jobs by logging into Impala • For e.g., ELT jobs produced files in .txt format on cloud which had the file size of 3.19 GB, got reduced to 516 MB after Parquet conversion ▪ Validate merging of multiple smaller files into one or more large files • For e.g., DQ-DS tool might produce multiple small-sized files (based on storage availability on underlying data nodes) which can be seen at the output location. Utility can be written to merge all these files into a single file in Parquet format
  • 14. 14 Structured Data – Challenges Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special characters ▪ Delimiters as a part of incoming data ▪ PHI data in a format different than test data may cause masking logic to fail ▪ Backdated inserts not captured in incremental runs Tools and Technology ▪ Limitations on data types, reserved keywords, special characters handled by cloud applications ▪ Date format conversion based on time zone selected during installation ▪ Data retrieval challenges and cost involved for downloading data for analysis Configuration / Environment ▪ Late arriving flat files ▪ Any service breakdown in production cause parts of end-to-end workflow to break Others ▪ Project timelines need to accommodate any unknowns in the data during production deployment and/or few weeks after deployment
  • 15. 15 Semi-structured Data (1/6) Flow Diagram Of Semi-Structured (JSON) Message Ingestion into Data lake
  • 16. 16 Semi-structured Data (2/6) Testing Gates ▪ Data Lakes creation and testing for semi-structured data can take place in the following areas: • JSON Message Validation • Data Reconciliation • ELT Framework (Extract, Load and Transform) • Data Quality and Standardization Validations
  • 17. 17 Semi-structured Data (3/6) JSON Message Validation ▪ Data lake can be integrated with Kafka Messaging System that produces JSON messages in semi- structured format ▪ As a part of JSON message validation, QA team can cover the following pointers: • Compare the JSON Message with the JSON schema provided as part of requirement • Data Type Check • Null / Not Null constraints Check ▪ For instance: JSON Schema JSON Message { "ServiceLevel": { "type": ["string", "null" ] }, "ServiceType": { "type": ["string"] } } { "ServiceLevel": "One", "ServiceType": "Skilled Nurse" }
  • 18. 18 Semi-structured Data (4/6) Data Reconciliation ▪ Data Reconciliation is a testing gate wherein the data loaded in target tables is compared against the data in the source JSON messages to ensure no data is trimmed/corrupted/missed in the migration process. One can use the following strategies to ensure the same: ▪ Record count for each table between source and data lake • Simple JSON: Single JSON message ingested in lake is loaded as single row in target table • Complex JSON: More than one row is loaded in the target table depending on the level of hierarchy and nesting present in the JSON message ▪ Data validation for each table between source and data lake. Data in each row and column in JSON message to be compared with data lake • To use openjson function in SQL Server for parsing the JSON messages and converting them into structured format • To compare the parsed output of openjson with the data loaded in target tables using Python
  • 19. 19 Semi-structured Data (5/6) ETL Framework Validation ▪ Raw layer Validation on HDFS: Indicates whether the source JSON messages ingested through Kafka are loaded in HDFS in raw form, before the processing tool selects the messages and loads it into the target tables ▪ Logging of Kafka details: Informs about Kafka topic, partition, offset, and hostname ▪ Generation of Data lake UIDs helps in identifying JSON messages ▪ Logging of records processed, loaded and rejected as part of each JSON ingestion shows the amount of records ingested through Kafka, processed, and failed in JSON schema validation with error logging ▪ Email notification: This shows JSONs ingested through Kafka on hourly basis, JSON count loaded in raw layer, and daily report which shows JSON wise count, failures and successful ingestion
  • 20. 20 Semi-structured Data (6/6) Data Quality and Standardization Validation ▪ The data in data lake needs to be cleansed and standardized for better analysis ▪ In this case, the actual data need not be removed or updated but both valid and erroneous data are logged into the audit tables ▪ Validate JSON messages against JSON schema • For e.g., Null Check: If non-nullable attribute is assigned null value in the input JSON message, then ingestion of such message fails, and the error gets logged in audit table with non-nullable attribute details
  • 21. 21 Semi-structured Data – Challenges Test Strategy ▪ ELT jobs are unable to manage continuously varying data with special characters ▪ Test data preparation as per test scenarios ▪ Manual validation of single JSON loaded into multiple tables ▪ Live reconciliation of messages produced through Kafka due to continuous streaming Tools and Technology ▪ Limitations on reserved keywords and special character handling Configuration / Environment ▪ Multiple services running simultaneously on cluster results in choking of JSON messages in QA environment ▪ Any service breakdown in production cause parts of end-to-end workflow to break Others ▪ Project timelines need to accommodate any unknowns in the data during production deployment and/or few weeks after deployment
  • 22. 22 References ▪ https://www.cloudmoyo.com/blog/difference-between-a-data-warehouse-and-a-data-lake/ ▪ https://utf8-chartable.de/unicode-utf8-table.pl ▪ https://www.cigniti.com/blog/5-big-data-testing-challenges/ ▪ https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/ ▪ https://kafka.apache.org/ ▪ https://dzone.com/articles/json-drivers-parsing-hierarchical-data
  • 23. About CitiusTech 3,500+ Healthcare IT professionals worldwide 1,500+ Healthcare software engineering 800+ HL7 certified professionals 30%+ CAGR over last 5 years 110+ Healthcare customers ▪ Healthcare technology companies ▪ Hospitals, IDNs & medical groups ▪ Payers and health plans ▪ ACO, MCO, HIE, HIX, NHIN and RHIO ▪ Pharma & Life Sciences companies 23 Thank You Authors: Vaibhav Shahane Vaibhavi Indap Technical Lead thoughtleaders@citiustech.com