Near-duplicate document detection is a well-known problem in the area of information retrieval. It is an important problem to be solved for many applications in IT industry. It has been studied with profound research literatures. This article provides a novel solution to this classic problem. We present the problem with abstract models along with additional concepts such as text models, document fingerprints and document similarity. With these concepts, the problem can be transformed into keyword like search problem with results ranked by document similarity. There are two major techniques. The first technique is to extract robust and unique fingerprints from a document. The second one is to calculate document similarity effectively. Algorithms for both fingerprint extraction and document similarity calculation are introduced as a complete solution.
Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering by up to 15%. Our corpus and tool are publicly available.
Anomaly Detection in Arabic Texts using Ngrams and Self Organizing MapsIJCSEA Journal
Every written text in any language has one author or more authors (authors have their individual sublanguage). An analysis of text if authors are not known could be done using methods of data analysis, data mining, and structural analysis. In this paper, two methods are described for anomaly detections: ngrams method and a system of Self-Organizing Maps working on sequences built from a text. there are analyzed and compared results of usable methods for discrepancies detection based on character n-gram profiles (the set of character n-gram normalized frequencies of a text) for Arabic texts. Arabic texts were analyzed from many statistical characteristics point of view. We applied some heuristics for measurements of text parts dissimilarities. We evaluate some Arabic texts and show its parts they contain discrepancies and they need some following analysis for anomaly detection. The analysis depends on selected parameters prepared in experiments. The system is trained to input sequences after which it determines text parts with anomalies using a cumulative error and winner analysis in the networks. Both methods have been tested on Arabic texts and they have a perspective contribution to text analysis.
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSIJCSEA Journal
Every written text in any language has one author or more authors (authors have their individual
sublanguage). An analysis of text if authors are not known could be done using methods of data analysis,
data mining, and structural analysis. In this paper, two methods are described for anomaly detections: ngrams method and a system of Self-Organizing Maps working on sequences built from a text. there are
analyzed and compared results of usable methods for discrepancies detection based on character n-gram
profiles (the set of character n-gram normalized frequencies of a text) for Arabic texts. Arabic texts were
analyzed from many statistical characteristics point of view. We applied some heuristics for measurements
of text parts dissimilarities. We evaluate some Arabic texts and show its parts they contain discrepancies and
they need some following analysis for anomaly detection. The analysis depends on selected parameters
prepared in experiments. The system is trained to input sequences after which it determines text parts with
anomalies using a cumulative error and winner analysis in the networks. Both methods have been tested on
Arabic texts and they have a perspective contribution to text analysis.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
An on-going project on Natural Language Processing (using Python and the NLTK toolkit), which focuses on the extraction of sentiment from a Question and its title on www.stackoverflow.com and determining the polarity.Based on the above findings, it is verified whether the rules and guidelines imposed by the SO community on the users are strictly followed or not.
Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering by up to 15%. Our corpus and tool are publicly available.
Anomaly Detection in Arabic Texts using Ngrams and Self Organizing MapsIJCSEA Journal
Every written text in any language has one author or more authors (authors have their individual sublanguage). An analysis of text if authors are not known could be done using methods of data analysis, data mining, and structural analysis. In this paper, two methods are described for anomaly detections: ngrams method and a system of Self-Organizing Maps working on sequences built from a text. there are analyzed and compared results of usable methods for discrepancies detection based on character n-gram profiles (the set of character n-gram normalized frequencies of a text) for Arabic texts. Arabic texts were analyzed from many statistical characteristics point of view. We applied some heuristics for measurements of text parts dissimilarities. We evaluate some Arabic texts and show its parts they contain discrepancies and they need some following analysis for anomaly detection. The analysis depends on selected parameters prepared in experiments. The system is trained to input sequences after which it determines text parts with anomalies using a cumulative error and winner analysis in the networks. Both methods have been tested on Arabic texts and they have a perspective contribution to text analysis.
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSIJCSEA Journal
Every written text in any language has one author or more authors (authors have their individual
sublanguage). An analysis of text if authors are not known could be done using methods of data analysis,
data mining, and structural analysis. In this paper, two methods are described for anomaly detections: ngrams method and a system of Self-Organizing Maps working on sequences built from a text. there are
analyzed and compared results of usable methods for discrepancies detection based on character n-gram
profiles (the set of character n-gram normalized frequencies of a text) for Arabic texts. Arabic texts were
analyzed from many statistical characteristics point of view. We applied some heuristics for measurements
of text parts dissimilarities. We evaluate some Arabic texts and show its parts they contain discrepancies and
they need some following analysis for anomaly detection. The analysis depends on selected parameters
prepared in experiments. The system is trained to input sequences after which it determines text parts with
anomalies using a cumulative error and winner analysis in the networks. Both methods have been tested on
Arabic texts and they have a perspective contribution to text analysis.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
An on-going project on Natural Language Processing (using Python and the NLTK toolkit), which focuses on the extraction of sentiment from a Question and its title on www.stackoverflow.com and determining the polarity.Based on the above findings, it is verified whether the rules and guidelines imposed by the SO community on the users are strictly followed or not.
Text independent speaker identification system using average pitch and forman...ijitjournal
The aim of this paper is to design a closed-set text-independent Speaker Identification system using average
pitch and speech features from formant analysis. The speech features represented by the speech signal are
potentially characterized by formant analysis (Power Spectral Density). In this paper we have designed two
methods: one for average pitch estimation based on Autocorrelation and other for formant analysis. The
average pitches of speech signals are calculated and employed with formant analysis. From the performance
comparison of the proposed method with some of the existing methods, it is evident that the designed
speaker identification system with the proposed method is superior to others.
A Comparative Result Analysis of Text Based Steganographic Approaches iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
An important aspect of a program, apart from its ability to solve the problem, is its maintainability. A program has to undergo frequent changes in its lifetime because of the change in the problems to be solved. If a program is not written in a manner that allows incorporating changes easily, after a while, it may become useless altogether.
One way to bring some discipline into programming practices is structured programming. It is a way of creating programs that ensures high quality of maintainability, reusability, amenability to easy debugging and readability.
GOTO
The important problem of word segmentation in Thai language is sentential noun phrase. The existing
studies try to minimize the problem. But there is no research that solves this problem directly. This study
investigates the approach to resolve this problem using conditional random fields which is a probabilistic
model to segment and label sequence data. The results present that the corrected data of noun phrase was
detected more than 78.61 % based on our technique.
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
The amount of data present online is growing very rapidly, hence a need for organizing and categorizing
data has become an obvious need. The Information Retrieval (IR) techniques act as an aid in assisting
users in obtaining relevant information. IR in the Indian context is very relevant as there are several blogs,
news publications in Indian languages present online. This work looks at the suitability of Naïve Bayesian
methods for paragraph level text classification in the Kannada language. The Naïve Bayesian methods are
the most primitive algorithms for Text Categorization tasks. We apply dimensionality reduction technique
using Minimum term frequency, stop word identification and elimination methods for achieving the task. It
is evident that Naïve Bayesian Multinomial model outperforms simple Naïve Bayesian approach in
paragraph classification tasks.
Pattern based approach for Natural Language Interface to DatabaseIJERA Editor
Natural Language Interface to Database (NLIDB) is an interesting and widely applicable research field. As the name suggests an NLIDB allows a naive user to ask query to database in natural language. This paper presents an NLIDB namely Pattern based Natural Language Interface to Database (PBNLIDB) in which patterns for simple query, aggregate function, relational operator, short-circuit logical operator and join are defined. The patterns are categorized into valid and invalid. Valid patterns are directly used to translate natural language query into Structured Query Language (SQL) query whereas an invalid pattern assists the query authoring service in generating options for user so that the query could be framed correctly. The system takes an English language query as input, recognizes pattern in the query, selects one of the before mentioned features of SQL based on the pattern, prepares an SQL statement, fires it on database and displays the result.
Any source code we write will have statements which work as instructions to the CPU. These statements are made of tokens. This Session will cover the concept of tokens.
This paper introduces a novel approach to tackle the existing gap on message translations in dialogue systems. Currently, submitted messages to the dialogue systems are considered as isolated sentences. Thus, missing context information impede the disambiguation of homographs words in ambiguous sentences. Our approach solves this disambiguation problem by using concepts over existing ontologies.
A syntactic analysis model for vietnamese questions in v dlg~tabl systemijnlc
This paper introduces a syntactic analysis model that we propose to parse and process the Vietnamese questions about tablets in V-DLG~TABL system, which is a Vietnamese Question – Answering system working based on automatic dialog mechanism. The V-DLG~TABL system is built to support clients using Vietnamese questions for searching tablets based on interaction between the clients and the system. We apply the “Phrase Structure Grammar” of Noam Chomsky to develop a syntactic analysis model that is specific and suitable for the V-DLG~TABL system. This syntactic analysis model is used to implement the “V-DLG~TABL Syntactic Parsing and Processing” component of the system.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
Text independent speaker identification system using average pitch and forman...ijitjournal
The aim of this paper is to design a closed-set text-independent Speaker Identification system using average
pitch and speech features from formant analysis. The speech features represented by the speech signal are
potentially characterized by formant analysis (Power Spectral Density). In this paper we have designed two
methods: one for average pitch estimation based on Autocorrelation and other for formant analysis. The
average pitches of speech signals are calculated and employed with formant analysis. From the performance
comparison of the proposed method with some of the existing methods, it is evident that the designed
speaker identification system with the proposed method is superior to others.
A Comparative Result Analysis of Text Based Steganographic Approaches iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
An important aspect of a program, apart from its ability to solve the problem, is its maintainability. A program has to undergo frequent changes in its lifetime because of the change in the problems to be solved. If a program is not written in a manner that allows incorporating changes easily, after a while, it may become useless altogether.
One way to bring some discipline into programming practices is structured programming. It is a way of creating programs that ensures high quality of maintainability, reusability, amenability to easy debugging and readability.
GOTO
The important problem of word segmentation in Thai language is sentential noun phrase. The existing
studies try to minimize the problem. But there is no research that solves this problem directly. This study
investigates the approach to resolve this problem using conditional random fields which is a probabilistic
model to segment and label sequence data. The results present that the corrected data of noun phrase was
detected more than 78.61 % based on our technique.
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
The amount of data present online is growing very rapidly, hence a need for organizing and categorizing
data has become an obvious need. The Information Retrieval (IR) techniques act as an aid in assisting
users in obtaining relevant information. IR in the Indian context is very relevant as there are several blogs,
news publications in Indian languages present online. This work looks at the suitability of Naïve Bayesian
methods for paragraph level text classification in the Kannada language. The Naïve Bayesian methods are
the most primitive algorithms for Text Categorization tasks. We apply dimensionality reduction technique
using Minimum term frequency, stop word identification and elimination methods for achieving the task. It
is evident that Naïve Bayesian Multinomial model outperforms simple Naïve Bayesian approach in
paragraph classification tasks.
Pattern based approach for Natural Language Interface to DatabaseIJERA Editor
Natural Language Interface to Database (NLIDB) is an interesting and widely applicable research field. As the name suggests an NLIDB allows a naive user to ask query to database in natural language. This paper presents an NLIDB namely Pattern based Natural Language Interface to Database (PBNLIDB) in which patterns for simple query, aggregate function, relational operator, short-circuit logical operator and join are defined. The patterns are categorized into valid and invalid. Valid patterns are directly used to translate natural language query into Structured Query Language (SQL) query whereas an invalid pattern assists the query authoring service in generating options for user so that the query could be framed correctly. The system takes an English language query as input, recognizes pattern in the query, selects one of the before mentioned features of SQL based on the pattern, prepares an SQL statement, fires it on database and displays the result.
Any source code we write will have statements which work as instructions to the CPU. These statements are made of tokens. This Session will cover the concept of tokens.
This paper introduces a novel approach to tackle the existing gap on message translations in dialogue systems. Currently, submitted messages to the dialogue systems are considered as isolated sentences. Thus, missing context information impede the disambiguation of homographs words in ambiguous sentences. Our approach solves this disambiguation problem by using concepts over existing ontologies.
A syntactic analysis model for vietnamese questions in v dlg~tabl systemijnlc
This paper introduces a syntactic analysis model that we propose to parse and process the Vietnamese questions about tablets in V-DLG~TABL system, which is a Vietnamese Question – Answering system working based on automatic dialog mechanism. The V-DLG~TABL system is built to support clients using Vietnamese questions for searching tablets based on interaction between the clients and the system. We apply the “Phrase Structure Grammar” of Noam Chomsky to develop a syntactic analysis model that is specific and suitable for the V-DLG~TABL system. This syntactic analysis model is used to implement the “V-DLG~TABL Syntactic Parsing and Processing” component of the system.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users’ interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
In recent years, there has been an increasing use of social media among people in Myanmar and writing review on social media pages about the product, movie, and trip are also popular among people. Moreover, most of the people are going to find the review pages about the product they want to buy before deciding whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very important and time consuming for people. Sentiment analysis is one of the important processes for extracting useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar Language.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSIJCSEA Journal
Every written text in any language has one author or more authors (authors have their individual
sublanguage). An analysis of text if authors are not known could be done using methods of data analysis,
data mining, and structural analysis. In this paper, two methods are described for anomaly detections: ngrams method and a system of Self-Organizing Maps working on sequences built from a text. there are
analyzed and compared results of usable methods for discrepancies detection based on character n-gram
profiles (the set of character n-gram normalized frequencies of a text) for Arabic texts. Arabic texts were
analyzed from many statistical characteristics point of view. We applied some heuristics for measurements
of text parts dissimilarities. We evaluate some Arabic texts and show its parts they contain discrepancies and
they need some following analysis for anomaly detection. The analysis depends on selected parameters
prepared in experiments. The system is trained to input sequences after which it determines text parts with
anomalies using a cumulative error and winner analysis in the networks. Both methods have been tested on
Arabic texts and they have a perspective contribution to text analysis.
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSIJCSEA Journal
Every written text in any language has one author or more authors (authors have their individual sublanguage). An analysis of text if authors are not known could be done using methods of data analysis, data mining, and structural analysis. In this paper, two methods are described for anomaly detections: ngrams method and a system of Self-Organizing Maps working on sequences built from a text. there are
analyzed and compared results of usable methods for discrepancies detection based on character n-gram
profiles (the set of character n-gram normalized frequencies of a text) for Arabic texts. Arabic texts were
analyzed from many statistical characteristics point of view. We applied some heuristics for measurements
of text parts dissimilarities. We evaluate some Arabic texts and show its parts they contain discrepancies and
they need some following analysis for anomaly detection. The analysis depends on selected parameters
prepared in experiments. The system is trained to input sequences after which it determines text parts with
anomalies using a cumulative error and winner analysis in the networks. Both methods have been tested on
Arabic texts and they have a perspective contribution to text analysis
International Journal of Computer Science, Engineering and Applications (IJCSEA)IJCSEA Journal
International Journal of Computer Science, Engineering and Applications (IJCSEA) is an open access peer-reviewed journal that publishes articles which contribute new results in all areas of the computer science, Engineering and Applications. The journal is devoted to the publication of high quality papers on theoretical and practical aspects of computer science, Engineering and Applications.
All submissions must describe original research, not published or currently under review for another conference or journal.
Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of Information Technology Convergence and services.
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGIJCI JOURNAL
The feature matching is a basic step in matching different datasets. This article proposes shows a new hybrid model of a pretrained Natural Language Processing (NLP) based model called BERT used in parallel with a statistical model based on Jaccard similarity to measure the similarity between list of features from two different datasets. This reduces the time required to search for correlations or manually match each feature from one dataset to another.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemLiwei Ren任力偉
The bad character shift rule of Boyer-Moore string search algorithm is studied in this paper for the purpose of extending it to more general string match problems. An abstract problem of string match is defined in general. An optimized string match algorithm based one the bad character heuristics is proposed to solve the abstract match problem efficiently.
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
Similarity digesting is a class of algorithms and technologies that generate hashes from files and preserve file similarity. They find applications in various areas across security industry: malware variant detection, spam filtering, computer forensic analysis, data loss prevention and etc.. There are a few schemes and tools available that include ssdeep, sdhash and TLSH. While being useful for detecting file similarity, they define similarity from different perspectives. In other words, they take different approaches to describe what file similarity is about. In order to compare those tools with better evaluation, we introduce a simple mathematical model to describe similarity that would cover all three schemes and beyond. This model enables us to establish a theoretic framework for analyzing essential differences of various similarity digesting algorithms & tools. As a result, a few tools are found to be complementary to each other so that we can use them in a hybrid approach in practice. Data experiment results are provided to support the theoretic analysis. In addition, we introduce a novel similarity digesting scheme that were designed based on the mathematical model.
IoT Security: Problems, Challenges and SolutionsLiwei Ren任力偉
As a novel computing platform in network, IoT will bring many security challenges to enterprise networks, and create new opportunities for security industry. This talk will provide a general overview of enterprise network security problems, especially the data security, caused by IoT. After that, a few existing security technologies are evaluated as necessary elements of a holistic network security that cover IoT devices. These technologies include : (a) IoT security monitoring and control; (b) FOTA for firmware vulnerability management; (c) NetFlow based big data security analysis. In the end, the practice of standard security protocols (such as OpenIoC and IODEF) will be strongly advocated for delivering effective IoT security solutions.
Differential compression (aka, delta encoding) is a special category for data de-duplication. It can find many applications in various domains such as data backup, software revision control systems, software incremental update, file synchronization over network, to name just a few. This talk will introduce a taxonomy of how to categorize delta encoding schemes in various applications. Pros & cons of each scheme will be investigated in depth.
Bytewise Approximate Match: Theory, Algorithms and ApplicationsLiwei Ren任力偉
Byte-wise approximate matching has become an important field in computer science that includes not only practical value but also theoretical significance. This talk will use six cases to define and describe the concept of approximate matching rigorously. They are identicalness, containment, cross-sharing, similarity, approximate containment and approximate cross-sharing. Based on the concept of approximate matching, one can propose a theoretic framework that consists of many problems of approximate matching, searching & clustering. Algorithmic solutions and challenges of the matching problems will be briefed as well as theoretic analysis. This framework also includes some elements of our previous works in both document fingerprinting problem and mathematical evaluation of similarity digest schemes { TLSH, ssdeep, sdhash }. In the end, we will discuss applications in various security disciplines.
Overview of Data Loss Prevention (DLP) TechnologyLiwei Ren任力偉
DLP is a technology that detects potential data breach incidents in timely manner and prevents them by monitoring data in-use (endpoints), in-motion (network traffic), and at-rest (data storage). It has been driven by regulatory compliances and intellectual property protection. This talk will introduce DLP models that describe the capabilities and scope that a DLP system should cover. A few system categories will be discussed accordingly with high-level system architecture. DLP is an interesting technology in that it provides advanced content inspection techniques. As such, a few content inspection techniques will be proposed and investigated in rigorous terms.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
1. Near Duplicate Document Detection: Mathematical
Modeling and Algorithms
Liwei Ren
Trend Micro
10101 North De Anza Boulevard
Cupertino, CA 95014, USA
1-408-850-1048
liwei_ren@trendmicro.com
Qiuer Xu
Trend Micro
Building B, Soho International Plaza
Nanjing, 210012, P.R. China
86-25-52386123
fallson_xu@trendmicro.com.cn
ABSTRACT
Near-duplicate document detection is a well-known problem in
the area of information retrieval. It is an important problem to be
solved for many applications in IT industry. It has been studied
with profound research literatures. This article provides a novel
solution to this classic problem. We present the problem with
abstract models along with additional concepts such as text
models, document fingerprints and document similarity. With
these concepts, the problem can be transformed into keyword like
search problem with results ranked by document similarity. There
are two major techniques. The first technique is to extract robust
and unique fingerprints from a document. The second one is to
calculate document similarity effectively. Algorithms for both
fingerprint extraction and document similarity calculation are
introduced as a complete solution.
Categories and Subject Descriptors
H.3.3: Information Search and Retrieval – information filtering,
retrieval models, search process .
General Terms
Algorithms, Experimentation.
Keywords
Duplicate Document, Near Duplicate Detection, Document
Fingerprint, Document Similarity, Retrieval Model, Information
Retrieval, Asymmetric Architecture
1. INTRODUCTION
Near duplicate document detection (NDDD) is a well-known
problem in the area of information retrieval. It is defined to
identify whether a given document is a near duplicate of one or
more documents from a well-defined document set. This problem
can be found in many technical areas such as crawling and
indexing optimization of web search engines, copy detection
systems, email archival, spam filtering, and data leak prevention
systems. There are profound research literatures discussing this
subject with numerous use cases and solutions [1-6]. Recently,
Kumar et al. [7] provided a thorough review of the most
significant works in decades that covers more than 60 papers.
We organize the following sections in the fashion of problem
definition, mathematical modeling and algorithmic solutions. We
will introduce formal problem definition to describe the problem
followed by three text models that are used to present documents.
One text model is selected for constructing algorithmic solution.
By introducing some concepts like document fingerprint and
document similarity, the problem can be decomposed into three
independent problems: (a) document fingerprint extraction; (b)
document similarity calculation; (c) fingerprint based search
engine. Two algorithms are constructed to extract fingerprints
from documents and measure the similarity between documents.
One can use utility of keyword based search engine for solving
the problem (c). Finally, an architecture of asymmetric
fingerprint generation is proposed to reduce the number of
fingerprints. Less number of fingerprints is critical for the success
of some special applications such as data leak prevention systems.
2. Problem Definition and Modeling
The problem proposed in the introduction section is not well-
defined from the perfective of practical implementation. In
practice, we need a quantitative measurement of how “near
duplicated” two documents are. We can need a more rigorous
definition for NDDD.
Definition 1 : Assume that we have a set of documents S. For
any given document d and a percentile X% , one needs to identify
multiple documents D1, D2, …, Dm from S such that SIM(d, Dj) ≥
X% for 1 ≤j ≤m, where SIM is a well-defined function to calculate
the similarity of two documents. The result {D1, D2, …, Dm} is
shown in the descending order of percentiles.
There are several challenges to solve this problem:
(a) The document set may be huge. It could be a scale in
multiples of millions or even billions of documents.
One certainly cannot compare d with each document of
S to calculate the similarity. How to efficiently identify
the reference document D from a huge document set ?
(b) How to construct the similarity function SIM?
Before we are able to answer the questions, we need to propose
text models to present a document. A text model allows us to
exclude irrelevant textual elements so that we can focus on the
essence .
2. Documents can be in any document format such as Word, Power
Point, Excel, PDF, Post Script and many others. The individual
words or sentences can be in different styles (bold, italic,
underline) and with varieties of fonts. These are not important
textual elements when we discuss “near duplicate”.
Fundamentally, we are more interested in the textual content that
carries semantic significance.
A document can be written in any writing language. The texts in
different languages can be encoded differently, for example,
English texts can be encoded in ASCII, Chinese in GB, and
Japanese in SJIS. However, all languages can be encoded in the
UTF-8 standard which is able to present all languages in one text.
For documents in English or any western language, most authors
view a text as a string of words [2-6]. Words can be extracted
from texts with tokenization technique that uses spaces to separate
words (or tokens) in sentences.
Some languages such as Chinese and Japanese do not use spaces
between words. In those eastern languages, a sentence is a string
of characters without spaces between them. All characters of
different languages can be encoded in UTF-8 characters. As such,
a text in all languages can be considered as a string of UTF-8
characters.
Depending on the languages, each UTF-8 character consists of
one or multiple bytes, for example, a Chinese character typically
consists of three bytes while an ASCII character is one byte.
Therefore, one can view a text also as a string of bytes if we
convert them from its original encoding into UTF-8.
Definition 2: We have three text models to present a document:
Model 1: A text is a string of tokens ( or a sequence
of tokens)
Model 2: A text is a string of UTF-8 characters.
Model 3: A text is a string of bytes when the text is
encoded in UTF-8.
In summary, a text is a string of basic textual items
where a basic textual unit item means a token, UTF-
8 character or byte.
Besides three models, there exist other text models that basic
textual units are sentences [5], textual lines, or even pages. Those
models are not interests to the authors of this article.
Numerous articles study NDDD using the text model 1. While
this model is good enough to study NDDD for documents in
western languages, it has obstacles when dealing with non-
western languages. Model 1 needs tokenization techniques.
Tokenization is a taunting task for processing documents in
Chinese and Japanese, especially.
There are few works adopting the text model 2 and 3 in academic
world. Manber [1] discussed duplicate detection in terms of pair
wise file matching of ASCII files. This is a special case of the
model 2 and 3. In contrast, it has become a common practice in
industry to apply text model 2 or 3 to many document
management problems such as DLP [8-10] , spam filtering and e-
Discovery. In this article, we use text model 2 to extract
fingerprints from documents and calculate the similarity between
two documents. Both text model 2 and 3 are language
independent while model 1 is not. Therefore, the techniques
developed in this article apply equally to documents in any
languages, and even apply to a document written in multiple
languages.
Definition 3: A document normalization is a process that consists
of three sub-processes applied sequentially:
(a) Converting a document in any format, such as Word,
Excel and PDF, into a plain text encoded in UTF-8;
(b) Converting any plain text in other encodings into a plain
text encoded in UTF-8;
(c) Removing the trivial characters such as white spaces,
delimiters, and control characters and etc. from the
UTF-8 texts.
Definition 4: The result of the document normalization is a string
of UTF-8 characters that contains the most significant information
of the original document. It is called a normalized text or
normalized document.
There are many software tools available for the document
normalization. Without loss of generality, we can consider all
documents as normalized texts in the rest of this article unless we
specify otherwise.
With so much discussion already, it is the time to tackle the two
challenges of Definition 1. To meet the first challenge, let us
introduce the concept of document fingerprint.
Definition 5: A document fingerprint is an integer or a binary
string with fixed length. Fingerprints can be generated from
documents by a function GEN. The fingerprints have the
following characteristics:
(a) A document D has multiple fingerprints { F1, F2, …,
Fn}, i.e., GEN(D) = { F1, F2, …, Fn}.
(b) Two irrelevant documents d and D do not have a
common fingerprint. That is GEN(d) ∩ GEN(D) = ϕ.
This is called the uniqueness.
(c) A fingerprint can survive moderate document changes.
That means GEN(d) ∩ GEN(D) ≠ ϕ if d is a near
duplicated copy of D . This is the robustness.
(d) In summary, a fingerprint is a unique invariant of
document variants.
A document D can be presented by multiple fingerprints, and let
us denote this relationship as D ↔ { F1, F2, …, Fn}. For any
document D from the document set S in Definition 1, we can
assign a unique document ID to it so that we establish a mapping
between ID and the fingerprints. We also denote this as ID ↔ {
F1, F2, …, Fn}. This would remind us of the keyword based
3. searching problem as we can index this relationship ID ↔ { F1,
F2, …, Fn} into indexing files when treating the fingerprints as
keywords. We can present the NDDD problem of Definition 1
with the following model supported by two procedures indexer
and searcher.
NDDD Model : Assume we have two functions: (a) fingerprint
generation function GEN; (b) document similarity measurement
function SIM, the NDDD problem is reduced into a fingerprint
based indexing and searching problem:
Indexer: Given a set of documents S, each document
is assigned a unique ID. We extract multiple
fingerprints { F1, F2, …, Fn} from each document D
with the function GEN. The indexer indexes them
together with the document ID, i.e., ID ↔ { F1, F2,
…, Fn}. The indexing results are saved into indexing
files.
Searcher: For any query document d and the
percentile X%, we extract multiple fingerprints { f1,
f2, …, fn} from the query document d with the
function GEN . The searcher uses them to retrieve
relevant document IDs from the indexing files. If a
reference document contains any of { f1, f2, …, fn},
its ID will be retrieved. With the ID, the reference
document D is retrieved as result. Then, we calculate
SIM(d,D) to measure the similarity. There may be
multiple reference documents retrieved. We
calculate the similarity for all of them, and rank the
results in descending order of similarity.
With the model shown as above, the NDDD problem actually is
decomposed into three independent problems.
Three Sub-Problems:
1. Fingerprint generation --- Generate multiple
fingerprints from a given document D by a fingerprint
generation function GEN(D).
2. Similarity measurement --- Calculate the similarity
between two documents d and D by the similarity
function SIM(d,D).
3. Indexing/Searching --- The indexer indexes document
ID and its fingerprints { F1, F2, …, Fn}. The searcher
retrieves document IDs against indices with given
fingerprints { F1, F2, …, Fn}. This is similar to keyword
based search engine such as Google or Lucene.
One can use general search engine framework or even relational
database system for solving the 3rd
problem. Therefore, we will
propose algorithmic solutions to the first and second problems
only.
3. Algorithms
This section provides algorithms to construct the two functions
GEN and SIM respectively.
The function GEN is to extract fingerprints from a given
normalized document. A fingerprint is a possible invariant of
text that can survive document changes. What can survive
changes? Changes of text can be caused by document
modification with editing operations such as insertion, deletion,
copy/paste and etc.. However, there are many pieces remaining in
the new text. These unchanged pieces shift relatively in text. If
we can identify some unchanged text pieces, we can use them as
text invariants to generate fingerprints. How to locate these
unchanged yet shifting pieces?
First of all, we use text model 2 to present a text as a string of
UTF-8 characters, i.e., let us denote this as T = c1 c2… cL where
L is the string length. Hence, we can discuss strings of characters
instead of texts or documents. Secondarily, we introduce a
concept as “anchoring points” which is briefly discussed in [1]
without implementation suggestions. An anchoring point is a
character in the string that remains the same relative to its
neighborhood when the string changes. One can use the
neighborhood around the anchoring point to generate a fingerprint
with a good hash function H. With multiple anchoring points, we
have multiple fingerprints for the document. There are two issues
to be solved. The first issue is how to select the robust anchoring
points since the string can change. The second issue is that there
may be too many anchoring points so that we generate too many
fingerprints from a given string. We propose algorithm 1 to
construct the function GEN which can handle these two issues.
Definition 6: We need some notations for writing up algorithm 1:
The alphabet A of UTF-8 characters appearing in the
string.
Two numbers N and M that selects most robust
anchoring points for generating fingerprints. M can be
fixed for any text string while N is selected according to
string size. Table 1 shows how M and N are configured
as an example.
The width W of anchoring neighborhoods.
A hash function H that generate a fingerprint from a
sub-string of size W. There is no specific requirement
for the hash function.
Character score function defined as
𝑛 ∗ (𝑃𝑛 − 𝑃1) (𝑃𝑖+1 − 𝑃𝑖)2
1≤𝑖<𝑛
Table 1: M and N are configured accordingly
Text Size Range M N
0-10K 4 128
10-20K 4 256
20-30K 4 256
30-50K 4 512
50-70K 4 1024
70-80K 4 1024
80-100K 4 1024
100-500K 4 1024
4. > 500K 4 1024
Algorithm 1:
Input: String T as c1 c2… cL
Output: Fingerprint set.
Procedure:
Step 1: Select the number N from Table 1 according to the string
length L.
Step 2: Run through the string T while counting the occurrences
of each unique UTF-8 character in A and saving the offsets.
Step 3: For each C ∈ A , the character C should have one or
multiple occurrences in T. Their offsets can be denoted as P1,
P2,… Pn . We use the score function to calculate the score for C.
Step 4: Pick M characters from A that have the highest scores .
That is B = { C1, C2,… CM }.
Step 5: For each C ∈ B, do step 6 to step 9
Step 6: For each occurrence of C in T, we have an anchoring
neighborhood which has C as its center. Each neighborhood is a
sub-string of size W. We denote these neighborhoods as S1, S2,…
Sn with respect to the occurrence offsets P1, P2, … Pn .
Step 7: Sort the list of sub-strings S1, S2,… Sn . Without loss of
generality, we can still denote the sorted list as S1, S2,… Sn .
Step 8: Select first K items from the sorted list where K =
MIN(N , n). They are {S1, S2,… SK }.
Step 9: Apply hash function H to {S1, S2,… SK} to generate K
fingerprints and add them to the fingerprint set.
The algorithm is stated based on text model 2. However, it is
good for other two models as well by replacing “character” by
either “token” or “byte”. The idea of the algorithm is
straightforward. First of all, it selects the most significant
character from the alphabet of the input string with a scoring
function to measure the significance. When calculating the score
of a given character, we consider both the frequency and
distribution of the character across the string. This is reflected in
the score function. Secondarily, for each picked character, it
chooses the robust anchoring points by sorting and picking the top
items from the list. Sorting is a mechanism to change randomness
into order. The result is a set of at most M*N fingerprints. For
example, when the normalized text size is less than 10KB which
is typical in real world, we get at most 4*128=512 fingerprints.
The function SIM is to calculate the similarity between two
normalized documents. We can use text model 2 to present a
document such that we actually compare two strings of characters.
What similarity means to them? If there are some common sub-
strings between two strings and the total length summed up is
long enough, we would consider that they are similar to each
other. We also expect that similarity can be measured in
percentile. We propose algorithm 2 to calculate the similarities
between one given document and a set of reference documents.
The main idea is to identify common sub-strings with hash based
greedy matching strategy.
Definition 7: We need some notations to present algorithm 2:
A number M that defines the minimum length of
common sub-strings. Common sub-strings must have
minimum length to avoid triviality, otherwise, a single
character can be a common sub-string.
A hash function H that generate a hash value from a
sub-string of size M. The hash table has chaining
capability to resolve collisions. There is no specific
requirement for the hash function. However, due to the
nature of the algorithm, a rolling hash function is
recommended for good performance.
A hash table HT.
For a string T, its substring can be denoted as T[s,…,e]
where s and e are the starting and ending offsets.
The algorithm is stated with text model 2. However, it
can be applied to other two models as well.
Algorithm 2:
Input: Query string d, and multiple reference strings {D1, D2,
…, Dm}
Output: The similarities {SIM1, SIM2, …, SIMm }
Procedure:
Step 1: Create the hash table HT based on L which is the size of
the input string d.
Step 2: For j = 0 to L-M
Apply the hash function H to the sub-string d[j,…,j+M-
1] of d to calculate the hash value h
Store offset j in H[h] or its chained linked-list.
Step 3: For each k in {1,2,…,m}, do step 4 to step 12
Step 4: Let Lk be the length of Dk , set P = 0 and SUM=0.
Step 5: Let h = H(Dk [P,…,P+M-1])
Step 6: If H[h] is empty, we have no match of sub-strings at this
offset P, let P=P+1, go to step 11
Step 7: For each sub-string offset s stored in the chaining linked-
list at H[h], do step 8
Step 8: If d[s,..,s+M-1] ≠Dk [P,…,P+M-1], set V(s)=0, otherwise,
let us extend the two equal sub-strings forward with common
characters as many as possible that arrives at the maximum
common sub-string size V(s.)
Step 9: Let V be the largest of all V(s) that we get from step 8.
Step 10: If V>0, let SUM = SUM + V, P = P + V, otherwise let
P = P + 1
Step 11: If P < Lk-M, go to Step 5
Step 12: Let SIMk = SUM / Lk
Algorithm 2 actually calculates all SIM(d,D1), SIM(d,D2), …
SIM(d, Dm) in one construction. The step 1 and 2 actually pre-
process d. And the step 4 to 12 are the steps to calculate
individual SIM(d,Dj) once a time.
For the normalized query document d and reference document D,
the algorithm 2 identifies a set of common sub-strings and sum up
all their lengths as SUM. Then similarity SIM can be measured
5. by SUM / Length(D). One may ask why we do not include the
length of d for the similarity. This is because we care more how
much of D is duplicated in the query document d than how much
of d is the content of D. One can certainly design another formula
to calculate the similarity from SUM and both lengths. Finally we
need to make sure SIM measures the similarity meaningfully.
This is guaranteed by the following theorem.
Theorem 1: The function SIM defined by algorithm 2 satisfies
the following properties for two normalized documents d and D:
1. 0 ≤SIM(d,D)≤ 1
2. If d and D are the same document, SIM(d,D)=1
3. If d and D have no common sub-strings at all,
SIM(d,D)=0.
Proof: From step 4 to 11 of algorithm 2, we have 0≤ SUM
≤Length(D). That proves 0 ≤SIM(d,D)≤1. If d=D, it is not
difficult to prove that SUM= Length(D), i.e., SIM(d,D)=1. The
last assertion is trivial.
4. Asymmetric Fingerprint Generation
For some special applications such as DLP (data loss prevention)
endpoint products, indexed fingerprinting files created on servers
must be delivered to remote machines which host searchers. It is
necessary to use less fingerprints to present a document in order to
save network bandwidth and cost. In algorithm 1, there are two
important parameters when generating the fingerprints. They are
the numbers M and N where M is fixed and N is configured
according to the text size defined by a table.
Based on recent experimental research, we can reduce the
fingerprints and keep almost the same recall rate if we apply
smaller number N to the function GEN at indexer side while the N
at server side is kept the same. In other words, we can solve the
NDDD problem even if the indexer can generate much less
number of fingerprints than the searcher. Table 2 is an example
for defining different N’s for both indexer and searcher.
Table 2 : Different N for Indexer and Searcher
Text Size Range M N for Indexer N for Searcher
0-10K 4 8 128
10-20K 4 16 256
20-30K 4 32 256
30-50K 4 32 512
50-70K 4 64 1024
70-80K 4 128 1024
80-100K 4 256 1024
100-500K 4 512 1024
> 500K 4 1024 1024
This method is referred as asymmetric fingerprint generation
while algorithm 1 is the symmetric fingerprint generation. And its
capability to keep almost the same recall rate is supported by the
following theoretical results.
Definition 8: Lets assume M is a constant number. For any
normalized document T, let us denote S( T, N) as the set of
fingerprints that is extracted from T with the number N.
Theorem 2: Let T be any normalized document, and n and m be
two positive integers. If n < m, we have S( T, n) ⊆ S(T,m)
which means the set S(T ,n) is a subset S(T, m).
Proof: This is a natural outcome from the step 8 of algorithm 1.
Theorem 3: Let D and d be two versions of same normalized
document, and n and m be two positive integers. If n < m, we have
S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d, m).
Proof: Since n<m, we have S(d, n) ⊆ S(d, m) and S(D, n)
⊆ S(D, m) and by theorem 1. Therefore, we have S(D, n) ∩ S(d,
n) ⊆ S(D, n) ∩ S(d, m) and S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d,
m). Together we have S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m)
⊆ S(D, m) ∩ S(d, m). This competes the proof.
Theorem 3 implies that the recall rate of asymmetric fingerprint
generation is between the two cases of symmetric fingerprint
generation with smaller and larger number of fingerprints. As a
matter of factor, the experimental data shows it is closer to the
second case while it generates much less fingerprints at indexer.
5. Experiments
In this section, we report a data experiment that we implemented
with the asymmetric architecture of fingerprint generation defined
by the parameters of table 2. Both indexer and searcher reside on
a server with Windows server 2003, Intel Xeron E5405@2.0GHZ,
8GB of RAM.
We prepared experimental data sets as follows:
Normalized document for indexing:
Corpus 1: this set consists of 1 million plain text
files in UTF-8 encoding. Let denote the corpus
1 as S1.
Corpus 2: this set consists of 2115 plain text
files in many different languages and has
different file sizes. They are totally irrelevant to
the files in S1. Let denote the corpus 1 as S2.
Let S = S1 ∪ S2. All files in S are registered for
fingerprint generation and indexing.
Normalized documents for querying:
Corpus 3: this set consists of 6*6* 2115 = 76140
files. This corpus consists of documents that are
made from S2 with 6 editing operations and 6
levels of changes presented in percentiles.
Corpus 3 will be used for querying experiment.
6 levels of changes are defined as 5%, 10%,
20%, 30%, 40% and 50%. For example, the
level 1 means we alter 5% content of an original
file.
6. 6 editing operations ADD, ADH, ADE, DEL,
CHG, MOV.
The 6 editing operations can be defined specifically as follows:
ADD: add a randomly generated block of chars at a
random position in the file.
ADH: add a randomly generated block of chars at a
random position in the file. Also add a randomly
generated block of chars with block size randomly
selected between 50-100 at the beginning of the file.
ADE: add a randomly generated block of chars at a
randomly selected position in the file. Also add a
randomly generated block of chars with block size
randomly selected between 50-100 in the ending of the
file.
DEL: delete a block of chars from the file. The start
point of deletion is randomly selected.
CHG : replace a randomly selected block of chars in the
file with a randomly generated block of chars.
MOV: move a randomly selected block of chars in the
file to a random position in the file.
Table 3: Querying time in seconds
Change level Total file
number
Total
Time
Sec per file in
average
5% 12690 1727 0.136
10% 12690 1776 0.139
20% 12690 1680 0.132
30% 12690 1709 0.134
40% 12690 1699 0.133
50% 12690 1649 0.129
Table 4: Numbers of files matched at each change level
Change
level
ADD ADH ADE DEL CHG MOV
5% 2080 2079 2082 2074 2071 2055
10% 2079 2069 2079 2073 2067 2055
20% 2045 2047 2055 2063 2029 2046
30% 2027 2019 2023 2058 1979 2041
40% 1993 2000 1998 2021 1924 2049
50% 1969 1977 1978 2020 1894 2049
Table 5: Total recall rate at each change level
Change level Total Files Recall Rate
5% 12441 98.03%
10% 12422 97.88%
20% 12285 96.80%
30% 12147 95.72%
40% 11985 94.44%
50% 11887 93.67%
Figure 1: Recall vs change level for different operations.
Experiment steps:
1. Fingerprint and index all the files in S.
2. Set X% = 20%. For any file from corpus 3, we use it
as a query document for the NDDD problem. The
recall and precision are measured according to the
query results. The performance of the querying speed
can be measured in seconds .
The experimental results are shown in table 3 , table 4, and the
figure 1.
Table 3 shows the performance when executing search for
6*2115=12690 query files with total number and the time per file
in average. For example, for change level 5%, the total time is
1727 seconds which means 0.136 second per file in average. This
is pretty fast when we consider the set S has more than 1 millions
documents fingerprinted.
Table 4 shows the recall rate for each change level and editing
operation. For example, for the change level 5% and ADD
7. operation, one has 2115 query files, we have 2080 successful
queries, that is 98.3%. Figure 1 illustrates recall rate vs change
level for each operation.
Table 5 shows the recall rates for all change levels. As the
document changes increase, the recall rate drops. The worst recall
rate is 93.67% when the change is around 50%.
We should mention that there is no false positive for all our
76140 query files. This is a natural outcome due to the following
reasons:
GEN and SIM are two string matching functions that
are independently constructed.
Even we may have false positives with fingerprint
match, X% will stop the false positives.
6. Conclusion
This article has examined and solved the problem of near
duplicate document detection. What we have studied can be
summarized as follows:
Formal definition for the problem NDDD.
Text models are discussed for effective presentation. A
language independent text model is selected to present
the documents
A NDDD model is proposed to refine the problem
definition which decomposes the NDDD problem into
three separate sub-problems that can be solved
independently.
Algorithms are introduced to extract document
fingerprints and calculate document similarity.
An architecture of asymmetric fingerprint generation is
introduced to reduce the number of fingerprints for
some special application.
The data experiment shows that our algorithmic solution
has good performance, near zero false positives and
pretty higher recall rate even the documents change up
to 50%.
The problem definition and algorithmic solution in this article has
advantages over other approaches. It has near zero false positive
since the similarity calculation is independent of fingerprint
generation. The recall rate is pretty good due to the fact that the
fingerprints are robust with moderate document changes. Finally,
the solution is language independent. It means we can apply the
solution to documents written in any language and even to
documents written in multiple languages.
7. REFERENCES
[1] Manber, U.1994. Finding Similar Files In A Large File
System. Proceedings of the USENIX Winter 1994 Technical
Conference, San Francisco, California
[2] Shivakumar, N. and Garcia Molina, H. 1999. Finding near-
replicas of documents on the web. Lecture Notes in Computer
Science, Springer Berlin / Heidelberg, 1590, 204-212.
[3] Lopresti, D. P. 1999. Models and Algorithms for Duplicate
Document Detection. Proceedings of the Fifth International
Conference on Document Analysis and Recognition, Bangalore,
India, 297-300, September, 1999
[4] Broder, A. Z. 2000. Identifying and Filtering Near-Duplicate
Documents. Proceedings of the 11th
Annual Symposium on
Combinatorial Pattern Matching, UK. Springer-Verlag, pp.1-10,
2000.
[5] Campbell, D. M. , Chen,W.R. and Smith, R. D.. 2000. Copy
detection systems for digital documents. Proceedings of
Advances in Digital Libraries , pp. 78-88, 2000
[6] Ignatov, D. I. and Jánosi-Rancz, K. T. 2009. Towards a
framework for near-duplicate detection in a document collections
based on closed sets of attributes. ACTA Univ. Sapientiae,
Informatica, 1, 2 (2009), 215-233
[7] Kumar, J.P. and Govindarajulu, P. 2009. Duplicate and Near
Duplicate Documents Detection: A Review. European Journal of
Scientific Research, 32, 4 (2009), 514-527.
[8] Ren, L.,Tan, D., Huang, F., Huang S. and Dong, A. 2009.
Matching engine with signature generation. US patent 7,516,130.
[9] Ren, L., Huang S, Huang, F., Dong, A. and Tan, D. 2010.
Matching engine for querying relevant documents . US patent
7,747,642.
[10] Ren, L., Huang S., Huang, F. and Lin, Y. 2010. Document
matching engine using asymmetric signature generation. US
patent 7,860,853.