SlideShare a Scribd company logo
1 of 5
Download to read offline
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8201
Deduplication detection for similarity in document analysis
via vector analysis
Mr. P. Sathiyanarayanan 1, Ms. P. Banushree 2, Ms. S. Subashree3
1Assistant Professor of CSE, 2,3 UG Scholar
Department of Computer Science and Engineering
Manakula Vinayagar Institue of Technology
Puducherrys
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Similarity paraphrase analysisisamachine
learning approach in which the system investigate and
group the human’s opinions, feelings, etc in the form of
text or speech about some topic. Nowadays, the textual
form of data has great impact among the users. The
textual information may be in structured, unstructured
or semi-structured form. In accord to improve their
products, brands etc., the opinion of the users are rated
which leads to the data storage in a huge amount. The
analysis of large amount of data is known as big data.
This paper intends to survey about the current
challenges in the similarity analysis and its scope in the
field of real time applications.
Keywords – Deduplicate , paraphrase, Bigdata,
analytics, data duplication.
1. INTRODUCTION
Word information is limited when compared with
article information. The informationcarried by asentence is
between that of a word and an article. Semantics in word
level can be easily matched but hard to be recalled as users
just use different word to express the same meaning.
Semantics in sentence level carries a single topic with its
context. Semantics in article level is complex with multiple
topics and complicated structures. As a result, the
information retrieval among these three levels is one
obstacle that impedes the development of natural language
understanding.
1.1 DATA MINING
Data mining is an interdisciplinary subfield of
computer science. It is the computational process of
discovering patterns in large data sets(“big data”)
involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is
to extract information from a data set and transform it
into an understandable structure for further use. Aside
from the raw analysisstep,itinvolvesdatabaseanddata
management aspects, data pre-processing, model and
inference considerations-interestingness-metrics,
complexity considera-tions, post processing of
discovered structures, visualization, and online
updating. Data mining is the analysis step of the
"knowledge discovery in databases" process, or KDD.
The actual data mining task is the automatic or semi-
automatic analysis of large quantities of data to extract
previously unknown, interesting patterns such as
groups of data records (cluster analysis), unusual
records (anomaly detection), and dependencies
(association rule mining). This usually involves using
database techniques such as spatial indices. These
patterns can then be seen as a kind of summary of the
input data, and may be used in further analysis or, for
example, in machine learning and predictive analytics.
For example, the data mining step might identify
multiple groups in the data, which can then be used to
obtain more accurate prediction results by a decision
support system. Neither the data collection, data
preparation, nor result interpretation and reporting is
part of the data miningstep,but do belong totheoverall
KDD process as additional steps.
The related terms data dredging, data fishing,
and data snooping refer to the use of data mining
methods to sample parts of a larger population data set
that are (or may be) too small for reliable statistical
inferences to be madeabout the validity of any patterns
discovered. These methods can, however, be used in
creating new hypotheses to test against the larger data
populations.
Big Data concern large-volume, complex, growing
data sets with multiple, autonomous sources. With the
fast development of networking, data storage, and the
data collection capacity, Big Data are now rapidly
expanding in all science and engineering domains,
including physical, biological and biomedical sciences.
This paper presents a HACE theorem thatcharacterizes
the features of the Big Data revolution, and proposes a
Big Data processing model, from the data mining
perspective.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8202
This data-driven model involves demand-driven
aggregationofinformationsources,miningandanalysis,
user interest modeling, and security and privacy
considerations. We analyze the challengingissuesinthe
data-driven model and also in the Big Data revolution.
1.2 BIG DATA
Big data is a collection of data sets so large and
complex that it becomes difficult to process using on-
hand database management tools. The challenges
include capture, curation, storage, search, sharing,
analysis, and visualization. The trend to larger data
sets is due to the additional information derivable
from analysis of a single large set of related data, as
compared to separatesmaller sets with the same total
amount of data, allowing correlations to be found to
"spot business trends, determine quality of research,
prevent diseases, link legal citations, combat crime,
and determine real-time roadway traffic conditions.
Put another way, big data is the realization of greater
business intelligence by storing, processing, and
analyzing data that was previously ignored due to the
limitations of traditional data management
technologies.
1.3 Some concepts
- No sql (not only SQL), Database that “move beyond”
relational data models (ie., no tables, limited or no
use of SQL).
- Focus on retrieval of data and appending new data
(not necessarile tables).
- Focus on key value data stores that can be used to
locate data objects.
- Focus on supporting storage of large quantities of
unstructured data.
- SQL is not used for storage or retrieval of data.
- No ACID (atomicity, consistency, isolation,
durability).
1.4 HADOOP
Hadoop is a distributed file system and data
processing engine that is designed to handle extremely
high volumes of data in any structure. Hadoop has two
components,
- The Hadoop distributed file system (HDFS),
which supports data in structured relational
form, in unstructured form, and in any form in
between
- The Map reduce programming paradigm for
managing applications on multiple distributed
server.
- The focus is on supporting redundancy,
distributed architectures, and parallel
processing
1.4.1Some Hadoop Related Names to Know
• Apache Avro: designed for communication
between Hadoop nodes through data
serialization
• Cassandra and Hbase: a non-relational
database designed for use with Hadoop
• Hive: a query language similar to SQL
(HiveQL) but ompatible with Hadoop
• Mahout: an AI tool designed for machine
learning; that is, to assist with filteringdata for
analysis and exploration
• Pig Latin: A data-flow language and execution
framework for parallel computation
• ZooKeeper: Keeps all the parts coordinated
and working together
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8203
What to do with the data
Figure 2 processing layer
The Knowledge Discovery in Databases (KDD)
process is commonly defined with the stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation/Evaluation.
It exists, however, in many variations on this theme,
such as the Cross Industry Standard Process for Data
Mining (CRISP-DM) which defines six phases:
(1) Business Understanding
(2) Data Understanding
(3) Data Preparation
(4) Modeling
(5) Evaluation
(6) Deployment
or a simplified process such as
(1) pre-processing,
(2) data mining, and
(3) results validation.
2. EXISTING SYSTEM
In the currentsystem,word vectorandtopicmodel
can help retrieve informationsemantically. Toovercome
the above problems, this paper proposes a newvector
computation model for text named s2v. Words, sentences,
and paragraphs are represented in a unified way in the
model.Sentence vectors and paragraph vectors are trained
along with word vectors. Based on the unified
representation, word and sentence (with different length)
retrieval are experimentally studied. The resultsshowthat
information with similar meaning can be retrieved even if
the information is expressed with different words.
3. PROPOSED WORK
The similarity paraphrase analysis is done by
extracting the abstract content for comparingthedocument.
Word information is limited when compared with article
information. The information carried by a sentence is
between that of a word and an article. Semantics in word
level can be easily matched but hard to be recalled as users
just use different word to express the same meaning.
Semantics in sentence level carries a single topic with its
context. Semantics in article level is complex with multiple
topics and complicated structures. As a result, the
information retrieval among these three levels is one
obstacle that impedes the development of natural language
understanding.
Then separation of words are combined in the form
of image by using word cloud net. The frequency of words
have been showed in the form of bar graph.Bythisresult, we
could determine whetherthe documentisduplicationoccurs
or not.
3.1 COMPLEXITY INVOLED IN THE PROPOSAL
1) Antonyms share high similarity when clustered
through word vectors.
2) Vectors for name entities cannot be fully trained, as
name entities may appear limited times in specific
corpus.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8204
3) Words, sentences,andparagraphs,sharingthesame
meaning but with no overlapping words,
are hard to be recognized.
BLOCK DIAGRAM:
In this block diagram it can identify
the given input in the sentences are in pargraph
level And then it divides the sentences so that we
can identify stemming and stopping words.
3.2 Sentence level
sentences are essentiallymadeupofwords,
it may be reasonable to argue that simply taking the
sum or the average of the constituent word vectors
should give a decent sentence representation.
This is akin to a bag-of-words representation, and hence
suffers from the same limitations, i.e.
- It ignores the order of words in the sentence.
- It ignores the sentence semantics completely.
Other word vector based approaches are also similarly
constrained.Forinstance,aweightedaveragetechnique
again loses word order within the sentence. To remedy
this issue, Socher et al. combinedthewordsintheorder
given by the parse tree of the sentence. While this
technique may be suitable for complete sentences, it
does not work for phrases or paragraphs.
3.3 Paragraph level
Paragraph Vectors has been recently
proposed as an unsupervised method for learning
distributed representations for pieces of texts. In their
work, the authors showed that the method can learn an
embedding of movie review texts which can be
leveraged for sentiment analysis. That proof of concept,
while encouraging, was rather narrow. Here we
consider tasks other than sentiment analysis, provide a
more thorough comparison of Paragraph Vectors to
other document modelling algorithms such as Latent
Dirichlet Allocation, and evaluate performance of the
method as we vary the dimensionality of the learned
representation.
We benchmarked the models on two
document similarity data sets, one from Wikipedia, one
from arXiv. We observe that the Paragraph Vector
method performs significantly better than other
methods, andproposea simpleimprovementto enhance
embedding quality. Somewhat surprisingly, we also
show that much like word embeddings, vector
operations on Paragraph Vectors can perform useful
semantic results.
3.4 Stemming words
In linguistic morphology and information
retrieval, stemming is the process of reducing inflected
(or sometimes derived) words to their word stem, base
or root form—generally a written word form. The stem
need not be identical to the morphological root of the
word; it is usually sufficient that related words map to
the same stem, even if this stem is not in itself a valid
root. Algorithms for stemming have been studied
in computer science since the 1960s. Many search
engines treat words with the same stemas synonyms as
a kind of query expansion, a process called conflation.
Given
input
Stemming
words
Stopping
words
Sentence
level
Paragraph
level
Word net
cloud
activated
Frequent
words
(visual bag
of words
Applying
semantic
similarity
approach
Tokenize
the word
Word
order
similarity
taken
Division of
sentences
Similarity
weight age
calculated
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8205
3.5 Stopping words
A stop wordisa commonlyusedword(such
as “the”, “a”, “an”, “in”) that a search engine has been
programmed to ignore, both when indexing entries for
searching and when retrieving them as the result of a
search query.
We would not want these words taking up space in
our database, or taking up valuable processing time. For
this, we can remove them easily, by storing a list of
words that you consider to bestopwords.NLTK(Natural
Language Toolkit) in python has a list of stopwords
stored in 16 different languages. You can find them in
the nltk_data directory.
4. CONCLUSION AND FUTURE WORK
In our project New vector
computation model was used. Words, sentences, and
paragraphs are represented in a unified way in the
model. Sentence vectors and paragraph vectors are
trained along with word vectors. It shows that
information with similar meaning can beretrievedeven
if the information is expressed with different words.
Data Deduplication technology usually identifies the
redundant data quickly, which can be used in corporate
or in banking sector. The textual informationmaybein
structured or semi-structured form. Whenever user
uploads a file in cloud ,System checks the file whether it
is existing or not by using vector analysis.
REFERENCES
[1]. (2015) Ben, W. “Every Day Big Data Statistics –
2.5 quintillion bytes of data created daily",
Available-http://www.vcloudnews.com/every-day-big
data-statistics-2-5-quintillion-bytes-of-data-created-
daily/
[2] (2015) Jaspreet, S. "Understanding Data
Deduplication",
Available:http://www.druva.com/blog/understanding-
data-de-duplication/
[3] Y. Zhang and D. Feng and H. Jiang and W. Xia and
M.Fu and F. Huang and Y. Zhou. “a fast asymmetri
extremum cont-ent defined chunking algorithm for
data deduplication in backup storage systems”, IEEE
Transactions on Computers, pp. issue: 99, 1-1, 2016.
[4] A. Venish,K. Siva Sankar. “Study of Chunking
Algorithm in Data Deduplication,” Proceedings of the
International Conference on Soft Computing Systems,
Springer India vol. 2, pp 13-20, 2015.

More Related Content

What's hot

A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...ijcsit
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaIJDKP
 
Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...
Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...
Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...IRJET Journal
 
IRJET- A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
IRJET-  	  A Survey on Predictive Analytics and Parallel Algorithms for Knowl...IRJET-  	  A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
IRJET- A Survey on Predictive Analytics and Parallel Algorithms for Knowl...IRJET Journal
 
A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data IJECEIAES
 
Data Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & SystemsData Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & SystemsNiloy Sikder
 
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set MiningAn Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set Miningijsrd.com
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3asad199
 

What's hot (15)

A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...
Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...
Quality of Groundwater in Lingala Mandal of YSR Kadapa District, Andhraprades...
 
IRJET- A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
IRJET-  	  A Survey on Predictive Analytics and Parallel Algorithms for Knowl...IRJET-  	  A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
IRJET- A Survey on Predictive Analytics and Parallel Algorithms for Knowl...
 
03. Data Preprocessing
03. Data Preprocessing03. Data Preprocessing
03. Data Preprocessing
 
A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data
 
Data Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & SystemsData Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & Systems
 
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set MiningAn Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
Data mining
Data miningData mining
Data mining
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3
 

Similar to IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector Analysis

A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICSA STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICSijistjournal
 
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET Journal
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek outiaemedu
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective ApproachIRJET Journal
 
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...AIRCC Publishing Corporation
 
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: C...
A META DATA VAULT APPROACH FOR  EVOLUTIONARY INTEGRATION OF BIG DATA SETS:  C...A META DATA VAULT APPROACH FOR  EVOLUTIONARY INTEGRATION OF BIG DATA SETS:  C...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: C...AIRCC Publishing Corporation
 
6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictiveEditorJST
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
research publish journal
research publish journalresearch publish journal
research publish journalrikaseorika
 
research publish journal
research publish journalresearch publish journal
research publish journalrikaseorika
 
11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data miningAlexander Decker
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksDr. Richard Otieno
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsIRJET Journal
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 

Similar to IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector Analysis (20)

A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICSA STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
 
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek out
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
 
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: C...
A META DATA VAULT APPROACH FOR  EVOLUTIONARY INTEGRATION OF BIG DATA SETS:  C...A META DATA VAULT APPROACH FOR  EVOLUTIONARY INTEGRATION OF BIG DATA SETS:  C...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: C...
 
6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
research publish journal
research publish journalresearch publish journal
research publish journal
 
research publish journal
research publish journalresearch publish journal
research publish journal
 
11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networks
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 

More from IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTUREIRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsIRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASIRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProIRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemIRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesIRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web applicationIRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignIRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...IRJET Journal
 

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 

Recently uploaded (20)

power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 

IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector Analysis

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8201 Deduplication detection for similarity in document analysis via vector analysis Mr. P. Sathiyanarayanan 1, Ms. P. Banushree 2, Ms. S. Subashree3 1Assistant Professor of CSE, 2,3 UG Scholar Department of Computer Science and Engineering Manakula Vinayagar Institue of Technology Puducherrys ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Similarity paraphrase analysisisamachine learning approach in which the system investigate and group the human’s opinions, feelings, etc in the form of text or speech about some topic. Nowadays, the textual form of data has great impact among the users. The textual information may be in structured, unstructured or semi-structured form. In accord to improve their products, brands etc., the opinion of the users are rated which leads to the data storage in a huge amount. The analysis of large amount of data is known as big data. This paper intends to survey about the current challenges in the similarity analysis and its scope in the field of real time applications. Keywords – Deduplicate , paraphrase, Bigdata, analytics, data duplication. 1. INTRODUCTION Word information is limited when compared with article information. The informationcarried by asentence is between that of a word and an article. Semantics in word level can be easily matched but hard to be recalled as users just use different word to express the same meaning. Semantics in sentence level carries a single topic with its context. Semantics in article level is complex with multiple topics and complicated structures. As a result, the information retrieval among these three levels is one obstacle that impedes the development of natural language understanding. 1.1 DATA MINING Data mining is an interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets(“big data”) involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysisstep,itinvolvesdatabaseanddata management aspects, data pre-processing, model and inference considerations-interestingness-metrics, complexity considera-tions, post processing of discovered structures, visualization, and online updating. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. The actual data mining task is the automatic or semi- automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data miningstep,but do belong totheoverall KDD process as additional steps. The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be madeabout the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations. Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem thatcharacterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8202 This data-driven model involves demand-driven aggregationofinformationsources,miningandanalysis, user interest modeling, and security and privacy considerations. We analyze the challengingissuesinthe data-driven model and also in the Big Data revolution. 1.2 BIG DATA Big data is a collection of data sets so large and complex that it becomes difficult to process using on- hand database management tools. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separatesmaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. Put another way, big data is the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies. 1.3 Some concepts - No sql (not only SQL), Database that “move beyond” relational data models (ie., no tables, limited or no use of SQL). - Focus on retrieval of data and appending new data (not necessarile tables). - Focus on key value data stores that can be used to locate data objects. - Focus on supporting storage of large quantities of unstructured data. - SQL is not used for storage or retrieval of data. - No ACID (atomicity, consistency, isolation, durability). 1.4 HADOOP Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure. Hadoop has two components, - The Hadoop distributed file system (HDFS), which supports data in structured relational form, in unstructured form, and in any form in between - The Map reduce programming paradigm for managing applications on multiple distributed server. - The focus is on supporting redundancy, distributed architectures, and parallel processing 1.4.1Some Hadoop Related Names to Know • Apache Avro: designed for communication between Hadoop nodes through data serialization • Cassandra and Hbase: a non-relational database designed for use with Hadoop • Hive: a query language similar to SQL (HiveQL) but ompatible with Hadoop • Mahout: an AI tool designed for machine learning; that is, to assist with filteringdata for analysis and exploration • Pig Latin: A data-flow language and execution framework for parallel computation • ZooKeeper: Keeps all the parts coordinated and working together
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8203 What to do with the data Figure 2 processing layer The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: (1) Selection (2) Pre-processing (3) Transformation (4) Data Mining (5) Interpretation/Evaluation. It exists, however, in many variations on this theme, such as the Cross Industry Standard Process for Data Mining (CRISP-DM) which defines six phases: (1) Business Understanding (2) Data Understanding (3) Data Preparation (4) Modeling (5) Evaluation (6) Deployment or a simplified process such as (1) pre-processing, (2) data mining, and (3) results validation. 2. EXISTING SYSTEM In the currentsystem,word vectorandtopicmodel can help retrieve informationsemantically. Toovercome the above problems, this paper proposes a newvector computation model for text named s2v. Words, sentences, and paragraphs are represented in a unified way in the model.Sentence vectors and paragraph vectors are trained along with word vectors. Based on the unified representation, word and sentence (with different length) retrieval are experimentally studied. The resultsshowthat information with similar meaning can be retrieved even if the information is expressed with different words. 3. PROPOSED WORK The similarity paraphrase analysis is done by extracting the abstract content for comparingthedocument. Word information is limited when compared with article information. The information carried by a sentence is between that of a word and an article. Semantics in word level can be easily matched but hard to be recalled as users just use different word to express the same meaning. Semantics in sentence level carries a single topic with its context. Semantics in article level is complex with multiple topics and complicated structures. As a result, the information retrieval among these three levels is one obstacle that impedes the development of natural language understanding. Then separation of words are combined in the form of image by using word cloud net. The frequency of words have been showed in the form of bar graph.Bythisresult, we could determine whetherthe documentisduplicationoccurs or not. 3.1 COMPLEXITY INVOLED IN THE PROPOSAL 1) Antonyms share high similarity when clustered through word vectors. 2) Vectors for name entities cannot be fully trained, as name entities may appear limited times in specific corpus.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8204 3) Words, sentences,andparagraphs,sharingthesame meaning but with no overlapping words, are hard to be recognized. BLOCK DIAGRAM: In this block diagram it can identify the given input in the sentences are in pargraph level And then it divides the sentences so that we can identify stemming and stopping words. 3.2 Sentence level sentences are essentiallymadeupofwords, it may be reasonable to argue that simply taking the sum or the average of the constituent word vectors should give a decent sentence representation. This is akin to a bag-of-words representation, and hence suffers from the same limitations, i.e. - It ignores the order of words in the sentence. - It ignores the sentence semantics completely. Other word vector based approaches are also similarly constrained.Forinstance,aweightedaveragetechnique again loses word order within the sentence. To remedy this issue, Socher et al. combinedthewordsintheorder given by the parse tree of the sentence. While this technique may be suitable for complete sentences, it does not work for phrases or paragraphs. 3.3 Paragraph level Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, andproposea simpleimprovementto enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results. 3.4 Stemming words In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stemas synonyms as a kind of query expansion, a process called conflation. Given input Stemming words Stopping words Sentence level Paragraph level Word net cloud activated Frequent words (visual bag of words Applying semantic similarity approach Tokenize the word Word order similarity taken Division of sentences Similarity weight age calculated
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 8205 3.5 Stopping words A stop wordisa commonlyusedword(such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to bestopwords.NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory. 4. CONCLUSION AND FUTURE WORK In our project New vector computation model was used. Words, sentences, and paragraphs are represented in a unified way in the model. Sentence vectors and paragraph vectors are trained along with word vectors. It shows that information with similar meaning can beretrievedeven if the information is expressed with different words. Data Deduplication technology usually identifies the redundant data quickly, which can be used in corporate or in banking sector. The textual informationmaybein structured or semi-structured form. Whenever user uploads a file in cloud ,System checks the file whether it is existing or not by using vector analysis. REFERENCES [1]. (2015) Ben, W. “Every Day Big Data Statistics – 2.5 quintillion bytes of data created daily", Available-http://www.vcloudnews.com/every-day-big data-statistics-2-5-quintillion-bytes-of-data-created- daily/ [2] (2015) Jaspreet, S. "Understanding Data Deduplication", Available:http://www.druva.com/blog/understanding- data-de-duplication/ [3] Y. Zhang and D. Feng and H. Jiang and W. Xia and M.Fu and F. Huang and Y. Zhou. “a fast asymmetri extremum cont-ent defined chunking algorithm for data deduplication in backup storage systems”, IEEE Transactions on Computers, pp. issue: 99, 1-1, 2016. [4] A. Venish,K. Siva Sankar. “Study of Chunking Algorithm in Data Deduplication,” Proceedings of the International Conference on Soft Computing Systems, Springer India vol. 2, pp 13-20, 2015.