Near Duplicate Detection for Medical Imaging Data Warehouse Construction

368 views

Published on

MediCurator poster presented at AMIA Joint Summits 2016. Discusses one of my recent works at Emory BMI.

Published in: Healthcare
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
368
On SlideShare
0
From Embeds
0
Number of Embeds
182
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Near Duplicate Detection for Medical Imaging Data Warehouse Construction

  1. 1. RESEARCH POSTER PRESENTATION DESIGN © 2015 www.PosterPresentations.com Introduction Distributed Near Duplicate Detection ● Integrate medical data from various heterogeneous medical data sources and private archives using the public APIs. ● Curate the integrated data into a data warehouse for public access. ● Store the detected duplicate pairs into a separate data source. ● Duplicate detection by analyzing the potential data pairs from the original data sources, using similarity matrices for textual data. ● Hierarchical meta data attached to the binary medical data to identify, classify, and find duplicates among the binary raw data. ● Considers the inconsistencies in representation. – Usage of acronyms instead of the full form of the attributes. – Using different measurement units. ● Data is published to various data sources by the medical data publishers – through the respective write APIs of the data sources. ● Connects to the original data sources through their read APIs. ● Output of consolidated data and duplicate pairs – stored through the relevant write APIs. ● Medical data consumers consume the data from the warehouse composed by MediCurator through its read API. ● The data warehouse is considered to be free from the duplicates – False positives and false negatives. – based on the effectiveness of the similarity matrices and similarity join algorithms used. References ● Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient similarity joins for near- duplicate detection. ACM Transactions on Database Systems (TODS), 36(3), 15. ● "Kathiravelu, Pradeeban; Galhardas, Helena; Veiga, Luís; ",∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data, On the Move to Meaningful Internet Systems: OTM 2015 Conferences, 237-256, 2015, Springer International Publishing ● "Kathiravelu, Pradeeban; Sharma, Ashish;", MEDIator: A Data Sharing Synchronization Platform for Heterogeneous Medical Image Archives, "Workshop on Connected Health at Big Data Era (BigCHat'15) , co-located with 21 st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2015)", 2015, ACM. ● Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. ● Hazelcast for a distributed near duplicate detection. ● Meta Data attached to the binary images in Medical Image Archives – The Cancer Imaging Archive (TCIA) ● ● ● ● ● ● ● ● ● ● ● ● Pradeeban Kathiravelu Ashish Sharma Medical Imaging Data Warehouse Construction Near Duplicate Detection for ● Medical data warehouses and image archives are constructed by integrating multiple private and public data sources. ● Finding almost identical entries is crucial for warehouse construction. ● Medical image archives are huge and consist of structured and hierarchical data, which may be accessed by querying the metadata. ● Existing solutions tend to be too specific. – Master Patient Index (MPI) for patient records. ● Multiple dimensions and attributes – including medications, clinical, and pathological data – should be considered for a complete duplicate detection and elimination. ● MediCurator is a near duplicate detection framework for heterogeneous medical data sources in constructing data warehouses. ● MediCurator has been developed to retrieve medical data from – various data sources, including: MySQL, MongoDB, CSV files, and – medical image archives such as TCIA ● MediCurator fits as part of the ETL process. – Duplicates are detected in-memory. – Merged data stored into data warehouses hosted in Hadoop Distributed File System (HDFS). MediCurator Approach Design Implementation ● A prototype has been implemented. – Hazelcast as the distributed execution framework. – Distributed execution of research near duplicate detection algorithms on metadata. – Speed-up of ten-folds, compared to the existing solutions such as MPI systems. ● MediCurator functions as an integration middleware – for data warehouse construction – with duplicate detection and elimination – from the raw textual medical data, or the binary data by leveraging the meta data attached to it. ● {pkathi2, ashish.sharma} @ emory.edu Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA. Acknowledgments * Google Summer of Code 2015 * NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory)

×