This document discusses de-duplicating data in a healthcare data lake using big data processing frameworks. It describes keeping duplicate records and querying the latest one, or rewriting records to create a golden copy. The preferred approach uses Spark to partition data, identify new/updated records, de-duplicate by selecting the latest from incremental and refined data, and overwrite only affected partitions. This creates a non-ambiguous, de-duplicated dataset for analysis in a scalable and cost-effective manner.
Russian Escorts Aishbagh Road * 9548273370 Naughty Call Girls Service in Lucknow
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing Frameworks
1. This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
De-duplicated Refined Zone in Healthcare Data
Lake Using Big Data Processing Frameworks
3rd May, 2018 | Author: Sagar Engineer| Technical Lead
CitiusTech Thought
Leadership
2. 2
Objective
Data preparation is a costly and complex process. Even a small error may lead to inconsistent
records and incorrect insights. Rectifying data errors often involves a significant time and effort.
Veracity plays an important role in data quality. Veracity generally describes issues such as
inconsistency, incompleteness, duplication and ambiguity of data; one of the important one is
data duplication.
Duplicate records can cause:
• Incorrect / unwanted / ambiguous reports and skewed decisions
• Difficulty in creating 360-degree view for a patient
• Problems in providing prompt issue resolution to customers
• Inefficiency and loss of productivity
• Large number of duplicate records may need more unnecessary processing power / time
Moreover, the data duplication issue becomes difficult to handle in Big Data because:
• Hadoop / Big Data ecosystem only supports appending data, record level updates are not
supported
• Updates are only possible by rewriting the entire dataset with merged records
The objective of this document is to provide an effective approach to create de-duplicated zone
in Data Lake using Big Data frameworks.
3. 3
Agenda
Addressing the Data Duplication Challenge
High Level Architecture
Implementing the Solution
References
4. 4
Addressing the Data Duplication Challenge (1/2)
Approach 1: Keep duplicate records in Data Lake, query records using maximum of timestamp to
get unique records
User needs to provide maximum timestamp as predicate in each data retrieval query
This option can cause performance issues when data increases beyond few terabytes depending
on the cluster size
In order to get better performance, this option needs a powerful cluster, causing an increase in
RAM / memory cost
Pros
Eliminates an additional step for de-duplication using batch processing
Leverages in-memory processing logic for retrieval of the latest records
Will work for datasets up to few hundreds of terabytes depending on the cluster size
Cons
Not feasible for hundreds of petabytes of data
High infrastructure cost for RAM / memory to fit in hundreds of terabytes of data
Response time for retrieval queries will be high if table joins are involved
5. 5
Addressing the Data Duplication Challenge (2/2)
Approach 2: Match and rewrite records to create a golden copy (Preferred Option)
Implement complex logic for identifying and rewriting records
Depending on the dataset and cluster size the time taken by the process varies
Creates a non-ambiguous golden copy of dataset for further analysis.
Pros
Heavy processing for de-duplication will be part of batch processing
Faster query response and scalable for joining tables
Data is stored on HDFS (Hadoop Distributed File System)
No concept of RegionServer instances which makes it cost effective to use
Concept of partitioning helps in segregating data
Support for file formats like parquet enables faster query response
Support for append and overwrite features on tables and partitions
Apache Hive is mainly used for heavy batch processing and analytical queries
Cons
Batch processing may take some time to complete
One-time coding effort
6. 6
Approach 2: High Level Architecture (1/2)
ETL
Hadoop Big Data LakeData Sources
Relational
Sources
MDM
Unstructured
Data
Landing
Zone
Raw
Zone
Refined
Zone
De-duplicated
Data Mart
Ad-hoc
Querying
Applications
Data
Visualization
Self-Service
Tool
Data Analysis
Golden
Record
7. 7
Approach 2: High Level Architecture (2/2)
Component Description
Landing Zone Data from source is loaded in the Landing zone and then compared with Raw zone during
processing. For example, to identify changed dataset or to perform data quality
Raw Zone Raw zone will have the relational data from the Landing zone and may be stored in partitions.
All the incremental data will be appended to Raw zone. Raw zone will also store the
unstructured/semi-structured data from respective sources. User can perform raw analytics
on Raw zone
ETL ETL framework picks up the data from Raw zone and applies transformations. For example,
mapping to target model / reconciliation, parsing unstructured/semi-structured data,
extracting specified elements and storing it in tabular format
Refined Zone Data from Raw zone is reconciled / standardized / cleansed and de-duplicated in Refined zone
Easy and proven 3-step approach to create refined deduped dataset in Hive using Spark/Hive
QL
This will be a perfect use case for Spark jobs / Hive queries depending upon the complexity
Comparing records based on keys and surviving records with the latest timestamp can be the
most effective way of de-duplication
Hadoop / HDFS is known to be efficient for saving data in Append mode. Handling data
updates in Hadoop is challenging & there is no bulletproof solution to handle it
8. 8
Implementing the Solution: Technology Options (1/2)
Use Hive as the
processing engine
Use HBase as data store for
de-duplication zone
Use Spark Based
processing engine
OptionsDescription
Hive uses MapReduce engine
for any SQL processing.
Leverage MapReduce jobs
spawned by Hive SQL to
identify updates and rewrite
updated datasets.
Use Hive query to find out
incremental updates and write
new files.
Compare incremental data
with existing data using Where
clause and get a list of all the
affected partitions.
Use HQL to find latest records
and rewrite affected
partitions.
HBase handles updates
efficiently on predefined Row
key which acts as primary key
to the table.
This approach helps in
building the reconciled table
without having to explicitly
write code for de-duplicating
the data.
Use Spark engine to implement
complex logic for identifying
and rewriting records.
Spark APIs are available in Java,
Scala, and Python. It also
includes Spark SQL for easy
data transformations
operations.
Use Hive context in Spark to
find incremental updates and
write new files.
Compare incremental data with
existing data using Where
clause and get a list of all the
affected partitions.
Use Spark to find latest records
and rewrite affected partitions
9. 9
Implementing the Solution: Technology Options (2/2)
Use Hive as the
processing engine
Use HBase as data store for
de-duplication zone
Use Spark Based
processing engine
OptionsPros
MapReduce distributed engine
can handle huge volume of
data
SQL makes it easy to write
logic instead of writing
complex MapReduce codes
Records can be retrieved in a
fraction of a second if searched
using row key.
HBase handles updates
efficiently on predefined Row
key which acts as primary key
Transactional processing and
real-time querying
100x faster than MapReduce
Relatively simpler to code
compared to MapReduce
Spark SQL, Data Frames, and Data
Sets API are readily available
Processing happens in-memory
and supports overflow to disk
Cons
MapReduce processing is
very slow
NoSQL makes it difficult to join
tables
High volume data ingestions can
be as slow as 5000 records/second
Data is stored in-memory on
HBase RegionServer instances
which requires more memory and
in turn increases cost
Ad hoc querying will perform full
table scanning which is not a
feasible approach
Infrastructure cost may go up
due to higher memory (RAM)
requirements due to in-
memory analytics
10. 10
Spark provides complete
processing stack for batch
processing, standard SQL based
processing, Machine Learning,
and stream processing.
However, memory requirement
increases with increase in
workload, infrastructure cost
may not go up drastically due to
decline in memory price.
Recommended Option: Spark Based Processing Engine
Solution Overview
Tables with data de-duplication need to be partitioned by the
appropriate attributes so that the data will be evenly distributed
Depending on use case, deduped tables may or may not host
semi-structured or unstructured data with unique key identifiers
Identify unique records in a given table. These attributes will be
used during de-duplication process
Incremental dataset must have a key to identify affected
partitions
Identify new records (records previously not present in data lake)
from incremental dataset
Insert new records in a temp table
Identify affected partitions containing records to be updated
Apply de-duplication logic to select only latest data from
incremental data and refined zone data
Overwrite only affected partitions in de-duplicated zone with the
latest data for updated records
Append new records from the temp table to refined de-
duplicated zone
13. 13
Thank You
Author:
Sagar Engineer
Technical Lead
thoughtleaders@citiustech.com
About CitiusTech
2,900+
Healthcare IT professionals worldwide
1,200+
Healthcare software engineering
700+
HL7 certified professionals
30%+
CAGR over last 5 years
80+
Healthcare customers
Healthcare technology companies
Hospitals, IDNs & medical groups
Payers and health plans
ACO, MCO, HIE, HIX, NHIN and RHIO
Pharma & Life Sciences companies