SlideShare a Scribd company logo
Copyright © IJIFR 2014 Author’s Subject Area: Computer Science and Engg
Available Online at: - http://www.ijifr.com/searchjournal.aspx
www.ijifr.com ijifr.journal@gmail.com ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH
An Enlightening Online Open Access, Refereed & Indexed Journal of Multidisciplinary Research
Volume -1 Issue -9, May 2014
01
UDD for Multiple Web Database
Abstract
UDD is algorithm that detects duplicates from unsupervised data in Web Database. UDD is
more effective for Record matching. Some web database uses supervised method that is very
complex to apply big Web database as well as time consuming process. UDD has solution of
this problem and also is suitable for dynamically generated data on web. When user puts
query on web, data is gathering from multiple web databases and filter by UDD to detect
duplicates results to show unique result set. For that we use WCSS classifier that performs
string comparisons to identify duplicates. This Paper Shows that UDD works good for
Multiple Web database where it is not possible by supervised method.
Keywords - Record Matching, WCSS classifier, Multiple Web Database, Duplicate Detection, DD
Algorithm
1 Introduction
Now days ,Web database is a main part of web service that store huge amount of data .Web
Database involves various category of data like text, images, documents etc. This database is
used to generate dynamically web page. End user have don’t control of such pages. If query
is send by user to web database, then it returns some data related to query. But user may get
repeated data from multiple web databases. (i.e. same Book from multiple web database). The
Web Database contains more amount of redundancy in that data. The main work in that Web
database is detecting the duplicate entries by observing and exact matching fields from
various data sources. In this paper we focuses on removing duplicates data that returned by
Database. It can be done by comparing results on return from database. It is more important
step to compare data from different sources. Fig. a shows the result that return from two
different online bookstores, amazon.com and books.rediff.com in response same query
“asp.net” over title page. It can be seen that result 3 in fig.1 and result 1 in fig b refer to the
same book, since they have same title although their price is different.
To overcome this problem, Unsupervised Duplicate Detection (UDD) algorithm is used.
1
Bhadakwan Shrikant, 2
Jagadale Prashant,
3
Bhanarkar Prasad, 4
Sandeep Bhatia
1, 2, 3, 4 Shardchandra Pawar College of Engineering
Pune University, Pune
PAPERID:IJIFR/V1/E9/001
Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database
www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E9/001
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH
Volume -1 Issue -9, May 2014
Author’s Research Area: Computer Science and Engg, Page No.:01-07
2
UDD is record matching technique to detect duplicates among records from different web
database. The primary aim of this paper is detection of duplicate, like two or more books
having same name, ISBN no or author. To identify duplicates from multiple web databases is
based on WCSS.
In the WCSS classifier, weight of data is calculate between records, if two or more records
having same weight are consider as duplicate and vector table are used to sort duplicate data
set and non-duplicate data set. These can be used for only multiple web database of single
domain and provides limited result.
UDD attentive on methods for regulate the weights of the entry fields in finding the
resemblance between two records. Two entries are supposed as duplicates while they are
exact similar on their entry fields. In the Figure 1 & 2 two figures shows the same book name
as Asp.net MVC 4 book and its author name is also same as below. The method that deal with
duplicate identification can be mostly classified as those necessary training data and those
that can purpose without a predefined training data .In this paper we have idea to explore un-
supervised methods and discover a method that can role with negligible supervision. The
evidence of using search engine example is to focus the essential for an algorithm that can
handle huge amounts of data and be able to find a unique set that is more relevant to the user
query.
Figure.1: Query Result from Amazon.com
Figure.2: Query Result From books.rediff.com
Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database
www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E9/001
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH
Volume -1 Issue -9, May 2014
Author’s Research Area: Computer Science and Engg, Page No.:01-07
3
By comparing the above Figures, we can find out the duplicate results. For these type of duplications
we avoid by using the Unsupervised Duplicate Detection Algorithm.
2 Related Work
2.1 Record Matching
Record matching, which identify duplicates record in response to user query. It represents identified
records as same real-world entity.
1. Problem Definition:
In this we are calculating the duplicates and non-duplicates results by using WCSS
classifier and in unsupervised duplicate detection algorithm we finding out the unique result.
2. Element Detection :
By using the Record matching we compare user query with all the records in the site1 and
site2 Web Databases i.e here we considering Amazon.com and book.rediffmail.com
respectively as shown in the following table1.
Table 1.Exact matching by Two Web Databases of dataset
Book Name Author Price Database
Professional
Asp.net MVC 4
Jon Galloway 399 Amazon.com
Beginning
ASP.NET 4.5 in
C# and VB
Imar Spaanjaars 483.90 Amazon.com
ASP.NET 4
Unleashed
Stephen 680 books.rediff.com
Professional
Asp.net MVC 4
Jon Galloway 520 books.rediff.com
Here we perform the exact matching as shown above table. Unsupervised Duplicate Elimination
(UDE) removes the problems of duplications.
3.Matching Records :
Highly Similar Records are considered as same entity. Identifying same entities from web
databases i.e. Duplicate Records. Compared to these above works, UDD algorithm developed
for Web database scenario where entries match of single type with multiple fields. Our work
focus based on assigning weight to every entry rather than the Similarity function. Moreover,
Our work is also casually connected to classification issues about single class of training data,
to find the identical data to the given dataset. Here we considered the totally same entry say as
Equality ,and assign weight is 1.If entry is half same say as stemming and weight value is 0.5
and entry is totally unmatched say as drop and weight value is 0.
3 Proposed System
New filtering methodologies are designed by exploiting the ordering information that are integrated
into the existing methods and drastically reduce the user sizes and hence improve the quality then new
technique Unsupervised Duplicate Detection (UDD) was developed for the specific record matching
Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database
www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E9/001
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH
Volume -1 Issue -9, May 2014
Author’s Research Area: Computer Science and Engg, Page No.:01-07
4
problem of identifying duplicates among records in query results from multiple Web databases. A new
exact similarity join algorithms is introduced with application to near duplicate detection. The
procedures are used for adjusting the weights of the record fields in manipulating the similarity
between two entries. Two entries are considered as identical if they are “similar enough” on their
fields. Various entries may need to be assigned various importance weights in an accommodative and
changing manner. Finally an efficient algorithm is developed, several optimizations that significantly
reduce the overall computation time using real data set and synthetic data set. The important goal is
on the Web database from the single domain. Various users enter the dissimilar queries in the search
box of search engine. An initial solution to this difficulty is that we can learn a Weighted Component
Similarity Summing Classifier (WCSS).
3.1 Weighted Component Similarity Summing Classifier
WCSS Classifier is utilized to detect some duplicate vectors when there are no positive examples
available .This dissertation builds up on the idea of UDD and is an attempt to enhance the algorithm by
introducing an additional classifier known as “Blocking Classifier”. The Weighted Component
Similarity Summing (WCSS) Classifier where the importance of the fields is determined and duplicates
are identified without any training. The idea for this classifier is to calculate the similarity between pair
of records by doing a field to filed comparison. Duplication methods in UDD are used which attention
on Web databases from the similar domain. Every function measures the equality of selected attributes
with other record fields and assigns a equality value for each field. The clustering techniques have been
selected to group the fields based on the similarity values. This matching and non-matching pairs is
used for clustering and to eliminate the duplicates. The rule based duplicate detection and elimination
approach is used for detecting and eliminating the records.
3.1.1 Weighted Component Assignment
In this WCSS Classifier assign weight to each component to shows the adjusant field. The Weight
Assign contains as follow:
1.Equality : By checking the each field with every field in the view table. If that field match
exactly to another field exactly then we assign weight as 1 known as Equality.
2.Stemming : Similarly, we check the field with another field ,if that field match halfy we assign
weight as 0.5 say as Stemming.
3.Drop : The field in the view table not match totally then we assign weight as 0 and this known
as Drop.
Suppose In the view table number of entries of Asp.net and C# book and check s all entries to every
entry and we assign weight in the following figure.
Figure.3: Weight Assign in WCSS Classifier.
4 Unsupervised Duplicate Detection (UDD)
Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database
www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E9/001
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH
Volume -1 Issue -9, May 2014
Author’s Research Area: Computer Science and Engg, Page No.:01-07
5
Duplicate records do not allocate a common key and they contain errors that create duplicate
matching a various task. Faults are presented as the result of record errors, imperfect information,
absence of standard formats or any grouping of these factors. we present a detailed analysis of the
history on duplicate record detection. We cover similarity metrics that are normally used to detect
similar field entries, and we present an general set of duplicate detection algorithms that can detect
duplicate records in a database. We here used the technique for Unsupervised Duplicate Detection.
4.1 Unsupervised Learning
As we stated earlier, the comparison space involves of comparison vectors which contain information
about the differences between fields in a couple of records. Except some data exists about which
comparison vectors correspond to which variety (match, non-match, or possible-match), the tagging of
the comparison vectors in the training data set should be done physically. One way to escape manual
tagging of the comparison vectors is to use classification algorithms, and group composed similar
comparison vectors. The hint behind most unsupervised learning methods for duplicate detection is that
similar comparison vectors look like to the same class.
4.2 Overview Of UDD Algorithm
The UDD is an Unsupervised way to detect the duplicates fields in the dataset. The key component of
the system is the component that has the UDD algorithm. we appearance at emerging an algorithm that
can train itself and assistance in identifying duplicates. This algorithm consists of a component that
analyzes the similarity vectors of a selection of dataset, allocating weights to the selected vectors. An
native solution to this difficulty is that we can study a classifier from N and practice the study classifier
to classify P. While there are some works based on studying from only positive (or negative) samples,
to our awareness all works in the history assume that the positive (or negative) samples are all correct.
The N may contain a tiny set of incorrect negative examples.
Figure 4: Duplicate vector detection process.
We recommend a method that detecting duplicate vectors in P repetitively, in a way like .The various
from these two mechanisms, in which only one classifier is utilized during the repetition, we service
two classifiers in each repetition that collaborate to detect duplicate vectors since P. Two classifiers
are utilized. detected positive instances are not operative enough to reinstruct the classifier to get a
Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database
www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E9/001
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH
Volume -1 Issue -9, May 2014
Author’s Research Area: Computer Science and Engg, Page No.:01-07
6
more correct premise, although two classifiers with various characteristics may develop various sets
of positive occurrences. Thus, the two classifiers can help since each other by taking importance of
duplicate vectors detected by the further classifier.
Figure 5: UDD Algorithm
4.3 Evaluation Metric
The actually correctness of the duplicate dataset can be check by using precision, recall and f-measure
to evaluate the performance of the algorithm, which describes as follow:
1. Recall: Recall means the percent of positive entries of duplicates.
2. Precision: Precision means the what percent of positive entries are correct
3. f-measure: f-measure of the system is defined as the weighted harmonic mean of its precision and
recall
Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database
www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014
This paper is available online at - http://www.ijifr.com/searchjournal.aspx
PAPER ID: IJIFR/V1/E9/001
ISSN (Online): 2347-1697
INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH
Volume -1 Issue -9, May 2014
Author’s Research Area: Computer Science and Engg, Page No.:01-07
7
5 Conclusion
Duplicate detection is an significant stage in data integration and supreme state-of-the-art techniques
are based on offline studying techniques, which need training data. Now the Web database condition,
where records to match are importantly query-dependent, a pertained method is not relevant as the set
of records in every query’s outcomes is a partial subset of the whole data set. To overcome this
difficulty, we offered an unsupervised, online method, UDD, for identifying duplicates above the
query results of multiple Web databases. Two classifiers, WCSS is utilizing supportively in the
merging step of record matching to detect the duplicate couples from all potential duplicate pairs
repetitively. Tentative results present that our method is distinguishable to preceding work that
requires training instances for detecting duplicates from the query results of multiple Web databases.
6. Acknowledgment
We wish to thank our HOD, Mr.A.D.Jadhav and to my guide Mr. S.K. Waman to inspire a lot for
writing & providing valuable guidance in writing this paper.
7. References
[1] Weifeng Su, Jiying Wang, and Frederick H. Lochovsky, Member, IEEE Computer Society “Record
Matching over Query Results From Multiple Web Databases”IEEE Trans.Knowledge and Data
Eng.,vol.22,no.4,pp.578-589,2010.
[2] P.Kowsiga, T.Mohanraj “Multi-Domain Record Matching over Query Results from Multiple Web
Databases” International Journal of Scientific & Engineering Research, Volume 3, Issue 5, May-2012 ISSN
2229-5518
[3] B.Daggupati ”UDD Of Query Result From Multiple Web Databases”
A Thesis Presented to The Faculty of the Computer Science Program
California State University Channel Islands
http://compsci.csuci.edu/images/programs/masters/daggupati2011.pdf.(year-2011)
[4] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc.
28th Int‟l Conf. Very Large Data Bases, pp. 586-597, 2002.
[5] “Record Matching Over Query Result Using Fuzzy Ontological Document Clustering”International Journal
on Computer Science And Engineering (IJCSE).

More Related Content

What's hot

Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
Mohit Sngg
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
theijes
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
ijdms
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
Yadhu Kiran
 
Annotating search results from web databases
Annotating search results from web databasesAnnotating search results from web databases
Annotating search results from web databases
IEEEFINALYEARPROJECTS
 
Annotating Search Results from Web Databases
Annotating Search Results from Web DatabasesAnnotating Search Results from Web Databases
Annotating Search Results from Web Databases
SWAMI06
 
Z04506138145
Z04506138145Z04506138145
Z04506138145
IJERA Editor
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
IJMER
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
IAESIJEECS
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
An Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data ClassificationAn Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data Classification
AM Publications
 
Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...
Cuong Tran Van
 
Annotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontologyAnnotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontology
ijnlc
 

What's hot (18)

Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
 
Annotating search results from web databases
Annotating search results from web databasesAnnotating search results from web databases
Annotating search results from web databases
 
Annotating Search Results from Web Databases
Annotating Search Results from Web DatabasesAnnotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
Z04506138145
Z04506138145Z04506138145
Z04506138145
 
Spe165 t
Spe165 tSpe165 t
Spe165 t
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
chemengine karthi acs sandiego rev1.0
chemengine karthi acs sandiego rev1.0chemengine karthi acs sandiego rev1.0
chemengine karthi acs sandiego rev1.0
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
ChemEngine ACS
ChemEngine ACSChemEngine ACS
ChemEngine ACS
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
An Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data ClassificationAn Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data Classification
 
Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...
 
Annotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontologyAnnotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontology
 

Viewers also liked

School of Management Sciences Varanasi Placement Brochure 2013-15 batch
School of Management Sciences Varanasi Placement Brochure 2013-15 batchSchool of Management Sciences Varanasi Placement Brochure 2013-15 batch
School of Management Sciences Varanasi Placement Brochure 2013-15 batch
School of Management Sciences, Varanasi
 
Il mutamento dei pubblici contemporanei: visitatori, utenti, partecipanti
Il mutamento dei pubblici contemporanei: visitatori, utenti, partecipantiIl mutamento dei pubblici contemporanei: visitatori, utenti, partecipanti
Il mutamento dei pubblici contemporanei: visitatori, utenti, partecipanti
Sara Radice
 
CV-DE-31082015
CV-DE-31082015CV-DE-31082015
CV-DE-31082015nirmal de
 
CV of Prof. _Dr._ Vishwa Nath Maurya for post of Professor-Head-Dean, Dept. o...
CV of Prof. _Dr._ Vishwa Nath Maurya for post of Professor-Head-Dean, Dept. o...CV of Prof. _Dr._ Vishwa Nath Maurya for post of Professor-Head-Dean, Dept. o...
CV of Prof. _Dr._ Vishwa Nath Maurya for post of Professor-Head-Dean, Dept. o...Prof. Dr.Vishwa Nath Maurya
 
A study on consumer buying behaviour at shoppers stop
A study on consumer buying behaviour at shoppers stopA study on consumer buying behaviour at shoppers stop
A study on consumer buying behaviour at shoppers stop
Projects Kart
 
Dr. Ravi S Pandey-Resume for Assistant Professor/ Research Scientist in Bioch...
Dr. Ravi S Pandey-Resume for Assistant Professor/ Research Scientist in Bioch...Dr. Ravi S Pandey-Resume for Assistant Professor/ Research Scientist in Bioch...
Dr. Ravi S Pandey-Resume for Assistant Professor/ Research Scientist in Bioch...
Dr. Swami Gyan Prakash
 
Thesis on film studio
Thesis on film studioThesis on film studio
Thesis on film studio
pratiklalshreshta
 

Viewers also liked (16)

Pb
PbPb
Pb
 
School of Management Sciences Varanasi Placement Brochure 2013-15 batch
School of Management Sciences Varanasi Placement Brochure 2013-15 batchSchool of Management Sciences Varanasi Placement Brochure 2013-15 batch
School of Management Sciences Varanasi Placement Brochure 2013-15 batch
 
Il mutamento dei pubblici contemporanei: visitatori, utenti, partecipanti
Il mutamento dei pubblici contemporanei: visitatori, utenti, partecipantiIl mutamento dei pubblici contemporanei: visitatori, utenti, partecipanti
Il mutamento dei pubblici contemporanei: visitatori, utenti, partecipanti
 
Capella Transcript(Unofficial)
Capella Transcript(Unofficial)Capella Transcript(Unofficial)
Capella Transcript(Unofficial)
 
Final Copy of Dissertation 40059632
Final Copy of Dissertation 40059632Final Copy of Dissertation 40059632
Final Copy of Dissertation 40059632
 
dr.s b mallur ubdtce
dr.s b mallur  ubdtcedr.s b mallur  ubdtce
dr.s b mallur ubdtce
 
CV-DE-31082015
CV-DE-31082015CV-DE-31082015
CV-DE-31082015
 
Comparative
ComparativeComparative
Comparative
 
151118-CV_RKJha
151118-CV_RKJha151118-CV_RKJha
151118-CV_RKJha
 
bio-data-1
bio-data-1bio-data-1
bio-data-1
 
CV of Prof. _Dr._ Vishwa Nath Maurya for post of Professor-Head-Dean, Dept. o...
CV of Prof. _Dr._ Vishwa Nath Maurya for post of Professor-Head-Dean, Dept. o...CV of Prof. _Dr._ Vishwa Nath Maurya for post of Professor-Head-Dean, Dept. o...
CV of Prof. _Dr._ Vishwa Nath Maurya for post of Professor-Head-Dean, Dept. o...
 
ressumee
ressumeeressumee
ressumee
 
A study on consumer buying behaviour at shoppers stop
A study on consumer buying behaviour at shoppers stopA study on consumer buying behaviour at shoppers stop
A study on consumer buying behaviour at shoppers stop
 
Dr. Ravi S Pandey-Resume for Assistant Professor/ Research Scientist in Bioch...
Dr. Ravi S Pandey-Resume for Assistant Professor/ Research Scientist in Bioch...Dr. Ravi S Pandey-Resume for Assistant Professor/ Research Scientist in Bioch...
Dr. Ravi S Pandey-Resume for Assistant Professor/ Research Scientist in Bioch...
 
Thesis on film studio
Thesis on film studioThesis on film studio
Thesis on film studio
 
resume S K
resume S Kresume S K
resume S K
 

Similar to Udd for multiple web databases

At33264269
At33264269At33264269
At33264269
IJERA Editor
 
Record matching over multiple query result - Document
Record matching over multiple query result - DocumentRecord matching over multiple query result - Document
Record matching over multiple query result - Document
Nishna Ma
 
Record matching
Record matchingRecord matching
Record matchingNishna Ma
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
tusharjadhav2611
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
IJDKP
 
JPJ1423 Keyword Query Routing
JPJ1423   Keyword Query RoutingJPJ1423   Keyword Query Routing
JPJ1423 Keyword Query Routing
chennaijp
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
idescitation
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
eSAT Publishing House
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
eSAT Journals
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...
nalini manogaran
 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query Routing
IJERA Editor
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
IJDKP
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
CSCJournals
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
Iaetsd Iaetsd
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record Linkage
IOSR Journals
 
Dt35682686
Dt35682686Dt35682686
Dt35682686
IJERA Editor
 
Online index recommendations for high dimensional databases using query workl...
Online index recommendations for high dimensional databases using query workl...Online index recommendations for high dimensional databases using query workl...
Online index recommendations for high dimensional databases using query workl...Mumbai Academisc
 
Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation
ijmpict
 

Similar to Udd for multiple web databases (20)

At33264269
At33264269At33264269
At33264269
 
Record matching over multiple query result - Document
Record matching over multiple query result - DocumentRecord matching over multiple query result - Document
Record matching over multiple query result - Document
 
Record matching
Record matchingRecord matching
Record matching
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
 
JPJ1423 Keyword Query Routing
JPJ1423   Keyword Query RoutingJPJ1423   Keyword Query Routing
JPJ1423 Keyword Query Routing
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...
 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query Routing
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record Linkage
 
Dt35682686
Dt35682686Dt35682686
Dt35682686
 
Online index recommendations for high dimensional databases using query workl...
Online index recommendations for high dimensional databases using query workl...Online index recommendations for high dimensional databases using query workl...
Online index recommendations for high dimensional databases using query workl...
 
Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation
 

Recently uploaded

6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
01-GPON Fundamental fttx ftth basic .pptx
01-GPON Fundamental fttx ftth basic .pptx01-GPON Fundamental fttx ftth basic .pptx
01-GPON Fundamental fttx ftth basic .pptx
benykoy2024
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
nooriasukmaningtyas
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
bhadouriyakaku
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
zwunae
 

Recently uploaded (20)

6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
01-GPON Fundamental fttx ftth basic .pptx
01-GPON Fundamental fttx ftth basic .pptx01-GPON Fundamental fttx ftth basic .pptx
01-GPON Fundamental fttx ftth basic .pptx
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
 

Udd for multiple web databases

  • 1. Copyright © IJIFR 2014 Author’s Subject Area: Computer Science and Engg Available Online at: - http://www.ijifr.com/searchjournal.aspx www.ijifr.com ijifr.journal@gmail.com ISSN (Online): 2347-1697 INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH An Enlightening Online Open Access, Refereed & Indexed Journal of Multidisciplinary Research Volume -1 Issue -9, May 2014 01 UDD for Multiple Web Database Abstract UDD is algorithm that detects duplicates from unsupervised data in Web Database. UDD is more effective for Record matching. Some web database uses supervised method that is very complex to apply big Web database as well as time consuming process. UDD has solution of this problem and also is suitable for dynamically generated data on web. When user puts query on web, data is gathering from multiple web databases and filter by UDD to detect duplicates results to show unique result set. For that we use WCSS classifier that performs string comparisons to identify duplicates. This Paper Shows that UDD works good for Multiple Web database where it is not possible by supervised method. Keywords - Record Matching, WCSS classifier, Multiple Web Database, Duplicate Detection, DD Algorithm 1 Introduction Now days ,Web database is a main part of web service that store huge amount of data .Web Database involves various category of data like text, images, documents etc. This database is used to generate dynamically web page. End user have don’t control of such pages. If query is send by user to web database, then it returns some data related to query. But user may get repeated data from multiple web databases. (i.e. same Book from multiple web database). The Web Database contains more amount of redundancy in that data. The main work in that Web database is detecting the duplicate entries by observing and exact matching fields from various data sources. In this paper we focuses on removing duplicates data that returned by Database. It can be done by comparing results on return from database. It is more important step to compare data from different sources. Fig. a shows the result that return from two different online bookstores, amazon.com and books.rediff.com in response same query “asp.net” over title page. It can be seen that result 3 in fig.1 and result 1 in fig b refer to the same book, since they have same title although their price is different. To overcome this problem, Unsupervised Duplicate Detection (UDD) algorithm is used. 1 Bhadakwan Shrikant, 2 Jagadale Prashant, 3 Bhanarkar Prasad, 4 Sandeep Bhatia 1, 2, 3, 4 Shardchandra Pawar College of Engineering Pune University, Pune PAPERID:IJIFR/V1/E9/001
  • 2. Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014 This paper is available online at - http://www.ijifr.com/searchjournal.aspx PAPER ID: IJIFR/V1/E9/001 ISSN (Online): 2347-1697 INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -9, May 2014 Author’s Research Area: Computer Science and Engg, Page No.:01-07 2 UDD is record matching technique to detect duplicates among records from different web database. The primary aim of this paper is detection of duplicate, like two or more books having same name, ISBN no or author. To identify duplicates from multiple web databases is based on WCSS. In the WCSS classifier, weight of data is calculate between records, if two or more records having same weight are consider as duplicate and vector table are used to sort duplicate data set and non-duplicate data set. These can be used for only multiple web database of single domain and provides limited result. UDD attentive on methods for regulate the weights of the entry fields in finding the resemblance between two records. Two entries are supposed as duplicates while they are exact similar on their entry fields. In the Figure 1 & 2 two figures shows the same book name as Asp.net MVC 4 book and its author name is also same as below. The method that deal with duplicate identification can be mostly classified as those necessary training data and those that can purpose without a predefined training data .In this paper we have idea to explore un- supervised methods and discover a method that can role with negligible supervision. The evidence of using search engine example is to focus the essential for an algorithm that can handle huge amounts of data and be able to find a unique set that is more relevant to the user query. Figure.1: Query Result from Amazon.com Figure.2: Query Result From books.rediff.com
  • 3. Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014 This paper is available online at - http://www.ijifr.com/searchjournal.aspx PAPER ID: IJIFR/V1/E9/001 ISSN (Online): 2347-1697 INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -9, May 2014 Author’s Research Area: Computer Science and Engg, Page No.:01-07 3 By comparing the above Figures, we can find out the duplicate results. For these type of duplications we avoid by using the Unsupervised Duplicate Detection Algorithm. 2 Related Work 2.1 Record Matching Record matching, which identify duplicates record in response to user query. It represents identified records as same real-world entity. 1. Problem Definition: In this we are calculating the duplicates and non-duplicates results by using WCSS classifier and in unsupervised duplicate detection algorithm we finding out the unique result. 2. Element Detection : By using the Record matching we compare user query with all the records in the site1 and site2 Web Databases i.e here we considering Amazon.com and book.rediffmail.com respectively as shown in the following table1. Table 1.Exact matching by Two Web Databases of dataset Book Name Author Price Database Professional Asp.net MVC 4 Jon Galloway 399 Amazon.com Beginning ASP.NET 4.5 in C# and VB Imar Spaanjaars 483.90 Amazon.com ASP.NET 4 Unleashed Stephen 680 books.rediff.com Professional Asp.net MVC 4 Jon Galloway 520 books.rediff.com Here we perform the exact matching as shown above table. Unsupervised Duplicate Elimination (UDE) removes the problems of duplications. 3.Matching Records : Highly Similar Records are considered as same entity. Identifying same entities from web databases i.e. Duplicate Records. Compared to these above works, UDD algorithm developed for Web database scenario where entries match of single type with multiple fields. Our work focus based on assigning weight to every entry rather than the Similarity function. Moreover, Our work is also casually connected to classification issues about single class of training data, to find the identical data to the given dataset. Here we considered the totally same entry say as Equality ,and assign weight is 1.If entry is half same say as stemming and weight value is 0.5 and entry is totally unmatched say as drop and weight value is 0. 3 Proposed System New filtering methodologies are designed by exploiting the ordering information that are integrated into the existing methods and drastically reduce the user sizes and hence improve the quality then new technique Unsupervised Duplicate Detection (UDD) was developed for the specific record matching
  • 4. Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014 This paper is available online at - http://www.ijifr.com/searchjournal.aspx PAPER ID: IJIFR/V1/E9/001 ISSN (Online): 2347-1697 INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -9, May 2014 Author’s Research Area: Computer Science and Engg, Page No.:01-07 4 problem of identifying duplicates among records in query results from multiple Web databases. A new exact similarity join algorithms is introduced with application to near duplicate detection. The procedures are used for adjusting the weights of the record fields in manipulating the similarity between two entries. Two entries are considered as identical if they are “similar enough” on their fields. Various entries may need to be assigned various importance weights in an accommodative and changing manner. Finally an efficient algorithm is developed, several optimizations that significantly reduce the overall computation time using real data set and synthetic data set. The important goal is on the Web database from the single domain. Various users enter the dissimilar queries in the search box of search engine. An initial solution to this difficulty is that we can learn a Weighted Component Similarity Summing Classifier (WCSS). 3.1 Weighted Component Similarity Summing Classifier WCSS Classifier is utilized to detect some duplicate vectors when there are no positive examples available .This dissertation builds up on the idea of UDD and is an attempt to enhance the algorithm by introducing an additional classifier known as “Blocking Classifier”. The Weighted Component Similarity Summing (WCSS) Classifier where the importance of the fields is determined and duplicates are identified without any training. The idea for this classifier is to calculate the similarity between pair of records by doing a field to filed comparison. Duplication methods in UDD are used which attention on Web databases from the similar domain. Every function measures the equality of selected attributes with other record fields and assigns a equality value for each field. The clustering techniques have been selected to group the fields based on the similarity values. This matching and non-matching pairs is used for clustering and to eliminate the duplicates. The rule based duplicate detection and elimination approach is used for detecting and eliminating the records. 3.1.1 Weighted Component Assignment In this WCSS Classifier assign weight to each component to shows the adjusant field. The Weight Assign contains as follow: 1.Equality : By checking the each field with every field in the view table. If that field match exactly to another field exactly then we assign weight as 1 known as Equality. 2.Stemming : Similarly, we check the field with another field ,if that field match halfy we assign weight as 0.5 say as Stemming. 3.Drop : The field in the view table not match totally then we assign weight as 0 and this known as Drop. Suppose In the view table number of entries of Asp.net and C# book and check s all entries to every entry and we assign weight in the following figure. Figure.3: Weight Assign in WCSS Classifier. 4 Unsupervised Duplicate Detection (UDD)
  • 5. Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014 This paper is available online at - http://www.ijifr.com/searchjournal.aspx PAPER ID: IJIFR/V1/E9/001 ISSN (Online): 2347-1697 INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -9, May 2014 Author’s Research Area: Computer Science and Engg, Page No.:01-07 5 Duplicate records do not allocate a common key and they contain errors that create duplicate matching a various task. Faults are presented as the result of record errors, imperfect information, absence of standard formats or any grouping of these factors. we present a detailed analysis of the history on duplicate record detection. We cover similarity metrics that are normally used to detect similar field entries, and we present an general set of duplicate detection algorithms that can detect duplicate records in a database. We here used the technique for Unsupervised Duplicate Detection. 4.1 Unsupervised Learning As we stated earlier, the comparison space involves of comparison vectors which contain information about the differences between fields in a couple of records. Except some data exists about which comparison vectors correspond to which variety (match, non-match, or possible-match), the tagging of the comparison vectors in the training data set should be done physically. One way to escape manual tagging of the comparison vectors is to use classification algorithms, and group composed similar comparison vectors. The hint behind most unsupervised learning methods for duplicate detection is that similar comparison vectors look like to the same class. 4.2 Overview Of UDD Algorithm The UDD is an Unsupervised way to detect the duplicates fields in the dataset. The key component of the system is the component that has the UDD algorithm. we appearance at emerging an algorithm that can train itself and assistance in identifying duplicates. This algorithm consists of a component that analyzes the similarity vectors of a selection of dataset, allocating weights to the selected vectors. An native solution to this difficulty is that we can study a classifier from N and practice the study classifier to classify P. While there are some works based on studying from only positive (or negative) samples, to our awareness all works in the history assume that the positive (or negative) samples are all correct. The N may contain a tiny set of incorrect negative examples. Figure 4: Duplicate vector detection process. We recommend a method that detecting duplicate vectors in P repetitively, in a way like .The various from these two mechanisms, in which only one classifier is utilized during the repetition, we service two classifiers in each repetition that collaborate to detect duplicate vectors since P. Two classifiers are utilized. detected positive instances are not operative enough to reinstruct the classifier to get a
  • 6. Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014 This paper is available online at - http://www.ijifr.com/searchjournal.aspx PAPER ID: IJIFR/V1/E9/001 ISSN (Online): 2347-1697 INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -9, May 2014 Author’s Research Area: Computer Science and Engg, Page No.:01-07 6 more correct premise, although two classifiers with various characteristics may develop various sets of positive occurrences. Thus, the two classifiers can help since each other by taking importance of duplicate vectors detected by the further classifier. Figure 5: UDD Algorithm 4.3 Evaluation Metric The actually correctness of the duplicate dataset can be check by using precision, recall and f-measure to evaluate the performance of the algorithm, which describes as follow: 1. Recall: Recall means the percent of positive entries of duplicates. 2. Precision: Precision means the what percent of positive entries are correct 3. f-measure: f-measure of the system is defined as the weighted harmonic mean of its precision and recall
  • 7. Bhadakwan Shrikant, Jagadale Prashant, Bhanarkar Prasad, Sandeep Bhatia: UDD for Multiple Web Database www.ijifr.com Email: ijifr.journal@gmail.com © IJIFR 2014 This paper is available online at - http://www.ijifr.com/searchjournal.aspx PAPER ID: IJIFR/V1/E9/001 ISSN (Online): 2347-1697 INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -9, May 2014 Author’s Research Area: Computer Science and Engg, Page No.:01-07 7 5 Conclusion Duplicate detection is an significant stage in data integration and supreme state-of-the-art techniques are based on offline studying techniques, which need training data. Now the Web database condition, where records to match are importantly query-dependent, a pertained method is not relevant as the set of records in every query’s outcomes is a partial subset of the whole data set. To overcome this difficulty, we offered an unsupervised, online method, UDD, for identifying duplicates above the query results of multiple Web databases. Two classifiers, WCSS is utilizing supportively in the merging step of record matching to detect the duplicate couples from all potential duplicate pairs repetitively. Tentative results present that our method is distinguishable to preceding work that requires training instances for detecting duplicates from the query results of multiple Web databases. 6. Acknowledgment We wish to thank our HOD, Mr.A.D.Jadhav and to my guide Mr. S.K. Waman to inspire a lot for writing & providing valuable guidance in writing this paper. 7. References [1] Weifeng Su, Jiying Wang, and Frederick H. Lochovsky, Member, IEEE Computer Society “Record Matching over Query Results From Multiple Web Databases”IEEE Trans.Knowledge and Data Eng.,vol.22,no.4,pp.578-589,2010. [2] P.Kowsiga, T.Mohanraj “Multi-Domain Record Matching over Query Results from Multiple Web Databases” International Journal of Scientific & Engineering Research, Volume 3, Issue 5, May-2012 ISSN 2229-5518 [3] B.Daggupati ”UDD Of Query Result From Multiple Web Databases” A Thesis Presented to The Faculty of the Computer Science Program California State University Channel Islands http://compsci.csuci.edu/images/programs/masters/daggupati2011.pdf.(year-2011) [4] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. 28th Int‟l Conf. Very Large Data Bases, pp. 586-597, 2002. [5] “Record Matching Over Query Result Using Fuzzy Ontological Document Clustering”International Journal on Computer Science And Engineering (IJCSE).