Online data deduplication for in memory big-data analytic systems
1. 2020 – 2021
#13/ 19, 1st Floor, Municipal Colony, Kangayanellore Road, Gandhi Nagar, Vellore – 6.
Off: 0416-2247353 Mo: +91 9500218218 / +91 8220150373
Website: www.shakastech.com, Email - id: shakastech@gmail.com, info@shakastech.com
Online Data Deduplication for In-Memory Big-Data Analytic Systems
Abstract :
Given a set of files that show a certain degree of similarity, we consider a novel problem of
performing data redundancy elimination across a set of distributed worker nodes in a shared-
nothing in-memory big data analytic system. The redundancy elimination scheme is designed
in a manner that is: (i) space-efficient: the total space needed to store the files is minimized
and, (ii) access-isolation: data shuffling among server is also minimized. In this paper, we first
show that finding an access-efficient and space optimal solution is an NP-Hard problem.
Following this, we present the file partitioning algorithms that locate access-efficient solutions
in an incremental manner with minimal algorithm time complexity (polynomial time). Our
experimental verification on multiple data sets confirms that the proposed file partitioning
solution is able to achieve compression ratio close to the optimal compression performance
achieved by a centralized solution.