Near Duplicate Detection for Medical Imaging Data Warehouse Construction


Published on

MediCurator poster presented at AMIA Joint Summits 2016. Discusses one of my recent works at Emory BMI.

Published in: Healthcare
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Near Duplicate Detection for Medical Imaging Data Warehouse Construction

  1. 1. RESEARCH POSTER PRESENTATION DESIGN © 2015 Introduction Distributed Near Duplicate Detection ● Integrate medical data from various heterogeneous medical data sources and private archives using the public APIs. ● Curate the integrated data into a data warehouse for public access. ● Store the detected duplicate pairs into a separate data source. ● Duplicate detection by analyzing the potential data pairs from the original data sources, using similarity matrices for textual data. ● Hierarchical meta data attached to the binary medical data to identify, classify, and find duplicates among the binary raw data. ● Considers the inconsistencies in representation. – Usage of acronyms instead of the full form of the attributes. – Using different measurement units. ● Data is published to various data sources by the medical data publishers – through the respective write APIs of the data sources. ● Connects to the original data sources through their read APIs. ● Output of consolidated data and duplicate pairs – stored through the relevant write APIs. ● Medical data consumers consume the data from the warehouse composed by MediCurator through its read API. ● The data warehouse is considered to be free from the duplicates – False positives and false negatives. – based on the effectiveness of the similarity matrices and similarity join algorithms used. References ● Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient similarity joins for near- duplicate detection. ACM Transactions on Database Systems (TODS), 36(3), 15. ● "Kathiravelu, Pradeeban; Galhardas, Helena; Veiga, Luís; ",∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data, On the Move to Meaningful Internet Systems: OTM 2015 Conferences, 237-256, 2015, Springer International Publishing ● "Kathiravelu, Pradeeban; Sharma, Ashish;", MEDIator: A Data Sharing Synchronization Platform for Heterogeneous Medical Image Archives, "Workshop on Connected Health at Big Data Era (BigCHat'15) , co-located with 21 st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2015)", 2015, ACM. ● Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. ● Hazelcast for a distributed near duplicate detection. ● Meta Data attached to the binary images in Medical Image Archives – The Cancer Imaging Archive (TCIA) ● ● ● ● ● ● ● ● ● ● ● ● Pradeeban Kathiravelu Ashish Sharma Medical Imaging Data Warehouse Construction Near Duplicate Detection for ● Medical data warehouses and image archives are constructed by integrating multiple private and public data sources. ● Finding almost identical entries is crucial for warehouse construction. ● Medical image archives are huge and consist of structured and hierarchical data, which may be accessed by querying the metadata. ● Existing solutions tend to be too specific. – Master Patient Index (MPI) for patient records. ● Multiple dimensions and attributes – including medications, clinical, and pathological data – should be considered for a complete duplicate detection and elimination. ● MediCurator is a near duplicate detection framework for heterogeneous medical data sources in constructing data warehouses. ● MediCurator has been developed to retrieve medical data from – various data sources, including: MySQL, MongoDB, CSV files, and – medical image archives such as TCIA ● MediCurator fits as part of the ETL process. – Duplicates are detected in-memory. – Merged data stored into data warehouses hosted in Hadoop Distributed File System (HDFS). MediCurator Approach Design Implementation ● A prototype has been implemented. – Hazelcast as the distributed execution framework. – Distributed execution of research near duplicate detection algorithms on metadata. – Speed-up of ten-folds, compared to the existing solutions such as MPI systems. ● MediCurator functions as an integration middleware – for data warehouse construction – with duplicate detection and elimination – from the raw textual medical data, or the binary data by leveraging the meta data attached to it. ● {pkathi2, ashish.sharma} @ Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA. Acknowledgments * Google Summer of Code 2015 * NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory)