Data De-duplication (Spring 2014)

Data Deduplication
for Language
Documentation
UNDER THE GUIDANCE OF:-
DR. JAN CHOMICKI AND DR. JEFF GOOD
PRESENTED BY:
KAUSHAL HAKANI, SHAIL PARIKH, SHASHANK RALLAPALLI

Outline
 Introduction
 Challenges
 Steps followed
 Algorithms used
 Approach
 Experimental Results
 Limitations
 Conclusions

Introduction
 13 Villages
 7-9 “languages” spoken
 4 local isolates
 2 dialect clusters
 12000 people
 Localist attitudes
 Various class of people collecting
data

Aim
 Detect duplicate files in the data obtained by the researchers in
Cameroon.
 Decide which files to keep and which to remove.
 Remove duplicate files (De-duplicate)
 Maintain information about provenance of the deleted data.

Dataset (continued)
 Initial observations about the dataset reveals that it contains
following types of files
 Audio/Visual
 Audio recordings
 Video recordings
 Photographs/Scanned images
 Textual
 Transcriptions (some time-aligned, XML)
 Questionnaire data
 Lexical data (e.g., vocabulary items in a database)

Dataset (continued)
 Metadata
 Contains information about the actual data files
 System generated file
 Files generated by MAC OS (DS_Store)
There were approximately 231 unique file extensions that we observed
when we parsed the dataset.

Challenges
 Lack of standards in naming convention.
 Decide suitable factor of de-duplication
 File Name based or File Content based
 Decide a suitable factor to take this decision
 Get sample data to run different de-duplication techniques

Challenges (continued)
 Decide what de-duplication methods would be required
 Edit Distance
 Jaccard Similarity
 Checksum and examination of data within file.
 There were few other challenges that we faced
 Come up with appropriate factors to decide what files to delete from
the dataset
 Moving files over different filesystems.

Steps
Initial Filtering
•Group by File Size
•Sampling
Sampled Data
•De-duplicate on file name?
•De-duplicate on file content?

Steps
Experimental Observation
•De-duplicate based on file name
•Decide the de-duplication techniques to be used
Implementation
•Edit Distance
•Jaccard Similarity
•Custom Methods

Steps
Test sample data
•Results were satisfactory
•Also got data to compare results against
Ran on Actual Data
•Could potentially remove 384.41 GB out of a total of 928.45 GB. That is
about 41.4% of the data.

Algorithms
 Used following standard de-duplication algorithms
 Edit-Distance
 Jaccard Similarity (Using n-grams)
 Also used specialized algorithms
 Copy removal (Special to dataset)
 Bus removal (Again, a special method) NOT This →

Edit-Distance
 This algorithm gives the dissimilarity between two strings.
 It calculates the cost of converting a given string two the other one.
 The cost of insert, delete and replacement as 1.
 For example:
String s1 = “Mail Juice-21.gif”
String s2 = “Mail Juice-18.gif”

Example
String1 = “Mail Juice-21.gif”
String2 = “Mail Juice-18.gif”
 Set the cost of insert = 1 , delete = 1 and replacement = 1.
 Total cost of converting S1 to S2 is: 2.

Jaccard Coefficient
 This algorithm measures the similarity of two strings.
 It divides the strings based on decidable factor k.
 Then it calculates the containment of the grams of one string in the
list of grams of other string
 Jaccard Coefficient =
(𝑆1 ∩ 𝑆2)
(𝑆1 ∪ 𝑆2)

Example
String1 = MailJuice21
String2 = MailJuice18
Grams:-
String1[11] = [Mai, ail, il_, l_J, _Ju, Jui, uic, ice, ce_, e_2, _21, 21_]
String2[11] = [Mai, ail, il_, l_J, _Ju, Jui, uic, ice, ce_, e_1, _18, 18_]
S1 U S2 = 15
S1 ∩ S2 = 9
Jaccard Coefficient = 0.6 i.e. 60% Chance that they are similar.

Custom Methods
 There were certain cases were the files were duplicate but name
were not the same.
 For example
FILE NAME FILE SIZE
FOO50407.JPG 1.7 MB
FOO50407 (COPY).WAV 1.7 MB

Experimental Results
On sample data:
WAV, 98%
DELETED FILE SIZE VS TOTAL DELETED FILE SIZE
OTHERS, 2%

Experimental Results
On Total Data:
WAV
94%
FILE SIZE DELETED/TOTAL FILE SIZE DELETED
OTHERS
6%

Generated Log File
The column names from left to right are, new file name, old file name, old directory, size and
timestamp.

Limitations
 We have observed a few limitations that exist in the system we
made.
 Our system isn’t sensitive to the different date formats appearing with in
the file name and treats each of them differently.
 Example: 25-05-2008 and 2008-25-5 are treated differently
 Our system is also insensitive to abbreviations
 Example: MK for MunKen is not taken to be similar
So, human observation is still required to completely de-duplicate the
data, provided the ingestion is unstructured.

Conclusion
 Data de-duplication is a job-specific or to be precise, application-
specific task.
 So, according to given specifications and our implemented logic,
we can safely say, our methods have succeeded in de-duplicating
a huge amount of data and freeing almost 400 GB of the given
hard-drive of 1 TB.

Data De-duplication (Spring 2014)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to Data De-duplication (Spring 2014)

Similar to Data De-duplication (Spring 2014) (20)

Recently uploaded

Recently uploaded (20)

Data De-duplication (Spring 2014)

Editor's Notes