Successfully reported this slideshow.
Your SlideShare is downloading. ×

De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing Frameworks

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing Frameworks

  1. 1. This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech. De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing Frameworks 3rd May, 2018 | Author: Sagar Engineer| Technical Lead CitiusTech Thought Leadership
  2. 2. 2 Objective  Data preparation is a costly and complex process. Even a small error may lead to inconsistent records and incorrect insights. Rectifying data errors often involves a significant time and effort.  Veracity plays an important role in data quality. Veracity generally describes issues such as inconsistency, incompleteness, duplication and ambiguity of data; one of the important one is data duplication.  Duplicate records can cause: • Incorrect / unwanted / ambiguous reports and skewed decisions • Difficulty in creating 360-degree view for a patient • Problems in providing prompt issue resolution to customers • Inefficiency and loss of productivity • Large number of duplicate records may need more unnecessary processing power / time  Moreover, the data duplication issue becomes difficult to handle in Big Data because: • Hadoop / Big Data ecosystem only supports appending data, record level updates are not supported • Updates are only possible by rewriting the entire dataset with merged records  The objective of this document is to provide an effective approach to create de-duplicated zone in Data Lake using Big Data frameworks.
  3. 3. 3 Agenda  Addressing the Data Duplication Challenge  High Level Architecture  Implementing the Solution  References
  4. 4. 4 Addressing the Data Duplication Challenge (1/2) Approach 1: Keep duplicate records in Data Lake, query records using maximum of timestamp to get unique records  User needs to provide maximum timestamp as predicate in each data retrieval query  This option can cause performance issues when data increases beyond few terabytes depending on the cluster size  In order to get better performance, this option needs a powerful cluster, causing an increase in RAM / memory cost Pros  Eliminates an additional step for de-duplication using batch processing  Leverages in-memory processing logic for retrieval of the latest records  Will work for datasets up to few hundreds of terabytes depending on the cluster size Cons  Not feasible for hundreds of petabytes of data  High infrastructure cost for RAM / memory to fit in hundreds of terabytes of data  Response time for retrieval queries will be high if table joins are involved
  5. 5. 5 Addressing the Data Duplication Challenge (2/2) Approach 2: Match and rewrite records to create a golden copy (Preferred Option)  Implement complex logic for identifying and rewriting records  Depending on the dataset and cluster size the time taken by the process varies  Creates a non-ambiguous golden copy of dataset for further analysis. Pros  Heavy processing for de-duplication will be part of batch processing  Faster query response and scalable for joining tables  Data is stored on HDFS (Hadoop Distributed File System)  No concept of RegionServer instances which makes it cost effective to use  Concept of partitioning helps in segregating data  Support for file formats like parquet enables faster query response  Support for append and overwrite features on tables and partitions  Apache Hive is mainly used for heavy batch processing and analytical queries Cons  Batch processing may take some time to complete  One-time coding effort
  6. 6. 6 Approach 2: High Level Architecture (1/2) ETL Hadoop Big Data LakeData Sources Relational Sources MDM Unstructured Data Landing Zone Raw Zone Refined Zone De-duplicated Data Mart Ad-hoc Querying Applications Data Visualization Self-Service Tool Data Analysis Golden Record
  7. 7. 7 Approach 2: High Level Architecture (2/2) Component Description Landing Zone  Data from source is loaded in the Landing zone and then compared with Raw zone during processing. For example, to identify changed dataset or to perform data quality Raw Zone  Raw zone will have the relational data from the Landing zone and may be stored in partitions. All the incremental data will be appended to Raw zone. Raw zone will also store the unstructured/semi-structured data from respective sources. User can perform raw analytics on Raw zone ETL  ETL framework picks up the data from Raw zone and applies transformations. For example, mapping to target model / reconciliation, parsing unstructured/semi-structured data, extracting specified elements and storing it in tabular format Refined Zone  Data from Raw zone is reconciled / standardized / cleansed and de-duplicated in Refined zone  Easy and proven 3-step approach to create refined deduped dataset in Hive using Spark/Hive QL  This will be a perfect use case for Spark jobs / Hive queries depending upon the complexity  Comparing records based on keys and surviving records with the latest timestamp can be the most effective way of de-duplication  Hadoop / HDFS is known to be efficient for saving data in Append mode. Handling data updates in Hadoop is challenging & there is no bulletproof solution to handle it
  8. 8. 8 Implementing the Solution: Technology Options (1/2) Use Hive as the processing engine Use HBase as data store for de-duplication zone Use Spark Based processing engine OptionsDescription  Hive uses MapReduce engine for any SQL processing.  Leverage MapReduce jobs spawned by Hive SQL to identify updates and rewrite updated datasets.  Use Hive query to find out incremental updates and write new files.  Compare incremental data with existing data using Where clause and get a list of all the affected partitions.  Use HQL to find latest records and rewrite affected partitions.  HBase handles updates efficiently on predefined Row key which acts as primary key to the table.  This approach helps in building the reconciled table without having to explicitly write code for de-duplicating the data.  Use Spark engine to implement complex logic for identifying and rewriting records.  Spark APIs are available in Java, Scala, and Python. It also includes Spark SQL for easy data transformations operations.  Use Hive context in Spark to find incremental updates and write new files.  Compare incremental data with existing data using Where clause and get a list of all the affected partitions.  Use Spark to find latest records and rewrite affected partitions
  9. 9. 9 Implementing the Solution: Technology Options (2/2) Use Hive as the processing engine Use HBase as data store for de-duplication zone Use Spark Based processing engine OptionsPros  MapReduce distributed engine can handle huge volume of data  SQL makes it easy to write logic instead of writing complex MapReduce codes  Records can be retrieved in a fraction of a second if searched using row key.  HBase handles updates efficiently on predefined Row key which acts as primary key  Transactional processing and real-time querying  100x faster than MapReduce  Relatively simpler to code compared to MapReduce  Spark SQL, Data Frames, and Data Sets API are readily available  Processing happens in-memory and supports overflow to disk Cons  MapReduce processing is very slow  NoSQL makes it difficult to join tables  High volume data ingestions can be as slow as 5000 records/second  Data is stored in-memory on HBase RegionServer instances which requires more memory and in turn increases cost  Ad hoc querying will perform full table scanning which is not a feasible approach  Infrastructure cost may go up due to higher memory (RAM) requirements due to in- memory analytics
  10. 10. 10 Spark provides complete processing stack for batch processing, standard SQL based processing, Machine Learning, and stream processing. However, memory requirement increases with increase in workload, infrastructure cost may not go up drastically due to decline in memory price. Recommended Option: Spark Based Processing Engine Solution Overview  Tables with data de-duplication need to be partitioned by the appropriate attributes so that the data will be evenly distributed  Depending on use case, deduped tables may or may not host semi-structured or unstructured data with unique key identifiers  Identify unique records in a given table. These attributes will be used during de-duplication process  Incremental dataset must have a key to identify affected partitions  Identify new records (records previously not present in data lake) from incremental dataset  Insert new records in a temp table  Identify affected partitions containing records to be updated  Apply de-duplication logic to select only latest data from incremental data and refined zone data  Overwrite only affected partitions in de-duplicated zone with the latest data for updated records  Append new records from the temp table to refined de- duplicated zone
  11. 11. 11 References Data Lake http://www.pentaho.com/blog/5-keys-creating-killer-data-lake https://www.searchtechnologies.com/blog/search-data-lake-with-big-data https://knowledgent.com/whitepaper/design-successful-data-lake/ Hive Transaction Management https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions- ConfigurationValuestoSetforINSERT,UPDATE,DELETE
  12. 12. 12 Keywords  Data Lake  Data Lake Strategies  Refined Zone  Big Accurate Data  Golden Record
  13. 13. 13 Thank You Author: Sagar Engineer Technical Lead thoughtleaders@citiustech.com About CitiusTech 2,900+ Healthcare IT professionals worldwide 1,200+ Healthcare software engineering 700+ HL7 certified professionals 30%+ CAGR over last 5 years 80+ Healthcare customers  Healthcare technology companies  Hospitals, IDNs & medical groups  Payers and health plans  ACO, MCO, HIE, HIX, NHIN and RHIO  Pharma & Life Sciences companies

×