Data de-duplication, in our case also called "Entity Matching", is not just about reducing multiple instances of the same item to one in order to save some space. It is a challenging task with many practical applications from health care to fraud prevention: "Is this person the same patient as ten years ago, but has moved in the meantime?", "Is this personalized spam mail the same email that was sent to many others, yet customized in each case?", or "Was there a spike resp. what is the base level of similar looking account creations?". In the age of Big Data, we do have the data to answer such questions, but heuristically comparing each item to all other items quickly becomes technically prohibitive for huge data sets.
In this session, we will have a look at past practises and current developments in the world of data de-duplication. After that, we will look at how to leverage locality-sensitive hashing algorithmically to reduce the amount of comparisons to a workable level. A demonstration will feature our implementation of that algorithm on top of Riak and Storm. The session will then finish with an overview of experiments and results using that system on different datasets, including browser fingerprints, tweets, and news articles.