SlideShare a Scribd company logo
1 of 9
Download to read offline
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

Data Mining Model for the Data Retrieval from
Central Server Configuration
1

Srivatsan Sridharan , Kausal Malladi and Yamini Muralitharan

2

1

2

Department of Computer Science, IIIT - Bangalore, India.
Department of Software Engineering, IIIT - Bangalore, India.

ABSTRACT
A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most
relevant and updated for continuous text search queries. This paper focuses on handling continuous text
extraction sustaining high document traffic. The main objective is to retrieve recent updated documents
that are most relevant to the query by applying sliding window technique. Our solution indexes the
streamed documents in the main memory with structure based on the principles of inverted file, and
processes document arrival and expiration events with incremental threshold-based method. It also ensures
elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are
ranked based on user feedback and given higher priority for retrieval.

KEYWORDS
Continuous Text Queries, MapReduce Technique, Sliding Window.

1. INTRODUCTION
Data intensive applications such as electronic mail, news feed, telecommunication management,
automation of business reporting etc raise the need for a continuous text search and monitoring
model. In this model the documents arrive at the monitoring server as in the form of a stream. Each
query Q continuously retrieves, from a sliding window of the most recent documents, the k that is
most similar to a fixed set of search terms. Sliding window. This window reflects the interest of
the users in the newest available documents. It can be defined in two alternative ways. They are a)
count-based window contains the N most recent documents for some constant number N, b) timebased window contains only documents that arrived within last N time units. Thus, although a
document which may be relevant to a query, it is ignored, because it may not satisfy the time and
count constraints of the user. Incremental threshold method. The quintessence of the algorithm is
to employ threshold-based techniques to derive the initial result for a query, and then continue to
update the threshold to reflect document arrivals and expirations. At its core lies a memory-based
index similar to the conventional inverted file, complimented with fast updated techniques.
MapReduce technique. MapReduce is a powerful platform for large scale data processing. This
technique involves two steps namely a) map step: The master node takes the input, partitions it up
into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again
in turn, leading to a multi-level structure. The worker node processes the smaller problem, and
passes the answer back to its master node, b) reduce step: The master node then collects the
answers to all the sub-problems and combines them in some way to form the output – the answer
to the problem it was originally trying to solve. Unsupervised duplicate detection. [3] The problem
of identifying objects in databases that refer to the same real world entity, is known, among others,
as duplicate detection or record linkage. Here this method is used to identify documents that are all
DOI : 10.5121/ijcsit.2013.5514

177
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

alike and prevent them from being prepared in the result set. Our paper also focuses on ranking the
documents based on user feedback. The user is allowed to give feedback for each document that
has been retrieved. This feedback is used to rank the document and hence increase the probability
of the document to appear in the sliding window.
Visual Web Spider is a fully automated, multi-threaded web crawler that allows us to index and
collect specific web pages on the Internet. Once installed, it enables us to browse the Web in an
automated manner, indexing pages that contain specific keywords and phrases and exporting the
indexed data to a database on our local computer in the format of our choice. We want to collect
website links to build our own specialized web directory. We can configure Visual Web Spider
automatically. This program’s friendly, wizard-driven interface lets us customize our search in a
step-by-step manner. To index relevant web pages, just follow this simple sequence of steps.
After opening the wizard, enter the starting web page URL or let the program generate URL links
based on specific keywords or phrases. Then set the crawling rules and depth according to your
search strategy. Finally, specify the data you want to index and your project filename. That’s
pretty much it. Clicking on ‘Start’ sets the crawler to work. Crawling is fast, thanks to multithreading that allows up to 50 simultaneous threads. Another nice touch is that Visual Web Spider
can index the content of any HTML tag such as: page title (TITLE tag), page text (BODY tag),
HTML code (HTML tag), header text (H1-H6 tags), bold text (B tags), anchor text (A tags), alt
text (IMG tag, ALT attribute), keywords, description (META tags) and others. This program can
also list each page size and last modified date. Once the web pages have been indexed, Visual
Web Spider can export the indexed data to any of the following formats: Microsoft Access, Excel
(CSV), TXT, HTML, and MySQL script.

1.1.Key Features
A Personal, Customizable Web crawler. Crawling rules. Multi-threaded technology (up to 50
threads). Support for the robots exclusion protocol/standard (Robots.txt file and Robots META
tags);Index the contents of any HTML tag. Indexing rules; Export the indexed data into Microsoft
Access database, TEXT file, Excel file (CSV), HTML file, MySQL script file; Start crawling
from a list of the URLs specified by user; Start crawling using keywords and phrases; Store web
pages and media files on your local disk; Auto-resolve URL of redirected links; Detect broken
links; Filter the indexed data;

2. EXISTING SYSTEM
Drawbacks of the existing servers that tend to handle the heavy document traffic are: Cannot
efficiently monitor the data stream that has highly dynamic document traffic. The server alone
does the processing hence it involves large amount of time consumption. In case of continuous text
search queries and extraction every time the entire document set has to be scanned in order to find
the relevant documents. There is no confirmation that duplicate documents are not retrieved for the
given query. A large amount of documents cannot be stored in the main memory as it involves
large amount of CPU cost. Naïve solution: The most straightforward approach to evaluate the
continuous queries defined above is to scan the entire window contents D after every
update or in fixed time intervals, compute all the document scores, and report the top-k
documents. This method incurs high processing costs due to the need for frequent re
computations from scratch.

178
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

3. PROPOSED SYSTEM
3.1.Problem Formulation
In our model, a stream of documents flows into a central server. The user registers text queries at
the server, which is then responsible for continuously monitoring/reporting their results. As in
most stream processing systems, we store all the data in main memory in order to cope with
frequent updates, and design our methods with the primary goal of minimizing the CPU cost.
Moreover it is necessary to reduce the work load of the monitoring server.

3.2.Proposed Solution
In our solution we use the MapReduce technique in order to reduce the work load of the central
server, where the server acts as the master node, which splits up the processing task to several
worker nodes. The number of worker nodes, which have been assigned the processing task,
depends on the nature of query that has been put up by the user. Here the master node, upon
receiving a query from the user, assigns the workers to find the relevant result query set and
return the solution to the master node. The master node, after receiving the partial solutions from
the workers, integrates the results to produce the final result set for the given query. This can be
viewed schematically in the following Fig.1. Each worker/slave node is responsible uses the
incremental threshold algorithm for computing the result set of k relevant and recent documents
for the given query. The overall system architecture can be viewed as in the following Fig.2

Figure.2. System Architecture for the proposed Data Retrieval Model.

179
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

Fig. 2. A data Retrieval system using MapReduce.

Each element of the input stream comprises of a document d, a unique document identifier, the
document arrival time, a composition list. The composition list contains one (t, wdt) pair for each
term t belonging to T in the document and wdt is the frequency of the term in the document d.
The notations in this model are as follows in Fig 3.

Figure. 3. A Detailed list of the notations.

The worker node maintains an inverted index for each term t in the document. With the inverted
index, a query Q is processed as follows: the inverted lists for the terms t belonging to Q are
scanned and the partial wdt scores of each encountered document d are accumulated to produce
S(d/Q). The documents with the highest scores at the end are returned as the result.

3.3.Incremental Threshold Algorithm
Fig.3 represents the data structures that have been used in this system. The valid documents D are
stored in a single list, shown at the bottom of the figure. Each element of the list holds the stream
of information of document (identifier, text content, composition list, arrival time). D contains the
most recent documents for both count-based and time-based windows. Since documents expire in
180
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

first-in-first-out manner, D is maintained efficiently by inserting arriving documents at the end of
the list and deleting expiring ones from its head. On the top of the list of valid documents we
build an inverted index. The structure at the top of the figure is the dictionary of search terms. It is
an array that contains an entry for each term t belonging to T. The dictionary entry for t stores a
pointer to the corresponding inverted list Lt. Lt holds an impact entry for each document d that
contains t, together with a pointer to d’s full information in the document list. When a document
d arrives, an impact entry (d, wdt) (derived from d’s composition list) is inserted into the inverted
list of each term t that appears in d. Likewise, the impact entries of an entries of an expiring
document are removed from the respective inverted lists. To keep the inverted lists sorted on wdt
while supporting fast (logarithmic) insertions and deletions.
Initial Top-k Search: When a query is first submitted to the system, its top-k result is computed
using the initial search module. The process is an adaptation of the threshold algorithm. Here, the
inverted lists Lt of the query terms play the role of the sorted attribute lists. Unlike the original
threshold algorithm, however we do not probe the lists in a round robin fashion. Since the
similarity function associates different weights wQt with the query specifically, inspired by [4],
we probe the list Lt with the highest ct=wQt.wdnxt t value, where dnxt is the next document in Lt.
The global threshold gt, a notion used identically to the original algorithm, is the sum of ct values
for all the terms in Q. Consider query Q1 with search string “red rose” and k=2. Let term t20=”red”
and t11=”rose”. First the server identifies the inverted lists L11 and L20 (using the dictionary hash
table), and computes the values c11 =wQ1t11 .wd7t11 and c20=wQ1t20.wd6t20. In iteration 1, since c20 is
larger, the first entry of L20 is popped; the similarity score of the corresponding document, d6, is
computed by accessing its composition list in D and inserted into the tentative R. c20 is then
updated to impact entry which is above local threshold, but we would still include it in R as
unverified entry. The algorithm is as follows,
Algorithm Incremental Threshold with Duplicate Detection (Arriving dins, Expiring ddel)
1: Insert document dins into D (the system document list)
2: for all terms t in the composition list of dins do
3: for all documents in Lt
4: for all terms t in dins
5: Compute unique (dins)
6: wdinst != wdnxtt
7:
Insert the impact entry of dins into Lt
8:
Probe the threshold tree of Lt
9: for all queries Q where wdinst > =localThreshold do
10:
if Q has not been considered for dins in another Lt then
11:
Compute S (dins/Q)
12:
Insert dins into R
13: if S(dins/Q)>= old Sk then
14: Update Sk (since dins enters the top-k result)
15: Keep rolling up local thresholds while r <= Sk
16: Set new τ as influence threshold for Q
17:
Update local thresholds of Q
18:
Delete document ddel from D (the system document
list)
19:
for all terms t in the composition list of ddel do
20:
Delete the impact entry of ddel from Lt
21:
Probe the threshold tree of Lt
22:
for all queries Q where wddelt >= localThreshold do
23:
if Q has not been considered for ddel in another Lt then
24: Delete ddel from R
181
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

25: if S(ddel/Q) >= old Sk then
26:
Resume top-k search from local thresholds
27: Set new τ as influence threshold for Q
28: Update local thresholds of Q
After constructing the initial result set R using the above algorithm, only the documents that have
a score higher than or equal to the influence threshold t(tow) are verified. The main key point is
that no duplicate documents from the part of the result set R. This is ensured using unsupervised
duplication detection. The idea of unsupervised learning for duplicate detection has its roots in the
probabilistic model proposed by Fellegi and Sunter. When there is no training data to compute the
probability estimates, it is possible to use variations of the Expectation Maximization algorithm to
identify appropriate clusters in the data.

Figure. 4. Data Structures used for Incremental Threshold Algorithm.

4. PERFORMANCE AND IMPLEMENTATION ISSUES
In Figure.5 we empirically compare Incremental Threshold Algorithm (ITA) against Na¨ve. We
ı
enhance Na¨ve with the technique of, which retrieves the top-kmax documents (for an
ı
analytically derived kmax that is larger than k) whenever the result is computed from scratch,
in order to reduce the frequency of subsequent recomputations. We form a document
stream, which comprises 172,961 articles. After standard stopword removal, the dictionary
182
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

contains 181,978 terms. The documents are streamed into the monitoring system, following a
Poisson process with a mean arrival rate of 200 documents/second. We generate 1,000 queries
with k = 10 and terms selected randomly from the dictionary. We use a count- based window;
the results for a time-based one are similar. First, we investigate the effect of the number of
search terms n on the performance of ITA and Na¨ve. In Figure 5(a), we set the window size
ı
to 1,000 documents, and vary n from 4 to 40. We measure the average processing time, i.e., the
elapsed time between the arrival of a new document (which additionally causes the expiration of
an existing one) and the point where all the query results are updated accordingly. The
measurements are in milliseconds and are plotted in logarithmic scale. With more search terms
(i.e., larger n), an arriving/expiring document has a higher chance of sharing common terms with
the queries. This leads to an increase in the number of queries that need updating, and thus to a
longer processing time. ITA is about 10 times faster than Na¨ve for queries comprising 4 search
ı
terms, and 6 times faster for 40- term queries.
Querylength

Processing time for naïve in
msec

Processing time for ITA in
msec

3

0.9

0.07

8

1.5

0.2

13

3

0.4

18

8

0.7

23

13

1.0

Table 1. Data for Query Length and processing Time.

Figure. 5a. Comparison of Query Length and processing Time.

In Figure 5(b), we study the effect of the sliding window size N . We set the query length
to 10 terms, and vary N from 10 to 100,000 documents. A larger window holds more
valid documents in the system. For Na¨ve this imposes a higher cost whenever the result needs
ı
to be recomputed, because it scans the entire D. For ITA, the inverted lists grow longer,
leading to higher index update cost and slower arrival/expiration handling. ITA is 13 times faster
than Na¨ve for a window size of 10, and 18 times faster when the sliding window
ı
comprises 10,000 documents. Note that the last measurement for Na¨ve is missing, because for
ı
N > 10000 the CPU utilization approaches 100% and the system becomes unstable. The above
experiments, as well as others omitted due to lack of space, verify the general superiority of ITA
over Na¨ve.
ı

183
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

Window size

Processing time for naïve
in msec

Processing time for ITA
in msec

5

10

5

13

20

7

25

30

13

30

40

15

Table 1. Data for Window Size and processing Time.

Figure. 5b. Comparison of Window Size and processing Time.

5. CONCLUSION
In this paper, we study the processing of continuous text queries over document streams. These
queries define a set of search terms, and request continual monitoring of a ranked list of recent
documents that are most similar to those terms. The problem arises in a variety of text monitoring
applications, e.g., email and news tracking. To the best of our knowledge, this is the first attempt
to address this important problem. Currently, our study focuses on plain text documents. A
challenging direction for future work is to extend our methodology to documents tagged with
metadata and documents with a hyperlink structure, as well as to specialized scoring mechanisms
that may apply in these settings.

ACKNOWLEDGEMENTS
The authors would like to thank Prof. R. Parvatham Assistant Professor, Sri SaiRam Engineering
College for her continuous support in implementing this work.

REFERENCES
[1]
[2]
[3]
[4]

B.Babcock, S. Babu, M.Datar, R.Motwani, and J.Widom, 2002, “Models and Issues in Data
Streaming System” PODS’02, 1-16.
J.Zobel and A.Moffat, July 2006, “Inverted Files for Text Search Engines”, Computing Surveys,
vol.38, o.2, p1-55.
VNAnh and A.Moffat, 2002, “Impact Transformation: Effective Efficient Web Retrieval”, Int’l ACM
SIGIR conf. Research and Development in Information Retrieval.
V.N.Anh, O.de Krestser, and A. Moffat, 2001, “Vector-Space Ranking with Effective Early
Termination”, ACM SIGIR conf. Research and Development in Information Retrieval.
184
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013
[5]

Y. Zhang and J.Callan, 2001 “Maximum Likelihood Estimation for Filtering Tresholds,” ACM SIGIR
conf. Research and Development in Information Retrieval.
[6] M. Persin, J. Zobel and R. Sacks-Davis, 1996, “Filtered Document Retrieval with Frequency_Sorted
Indexes”, J.Am.Soc for Information Science, vol.47, no.10.
[7] Hemalatha S, Raja k, Tholkappia Arasu , 2011, “Duplicate Detection of web results from multiple
databases”, Computational Science- new dimensions and perspectives, NCCSE.
[8] Kyriakos Mouraditis, Spiridon Bakiras, Dimitris Papadias, “Continuous Monitoring of top-K queries
sliding window”.
[9] Kyriakos Mouraditis and Hweehwa Pang, October 2011 “Efficient Evaluation of Continuous Text
search queries” , IEEE Transactions on Knowledge and Data Engineering, Vol 20.
[10] Chen He, Ying Lu, David Swanson “Matchmaking: a New Mapreduce Scheduling Technique”,
University of Nebraska-Lincoln, Lincoln, U.S

Authors
Srivatsan Sridharan. Srivatsan is a young academician who is highly passionate to learn
and adapt himself to newer technologies. Completed his Bachelor of Engineering from Sri
SaiRam Engineering College affiliated to Anna University - Chennai (Ranked 26th among
all the Affiliated Institutions in the Dept. of Computer Science), he is currently pursuing
his M.Tech from IIIT - Bangalore. He has been awarded thrice the Best Paper Award, One
from ICCCAN 2013, another from IEEE - ICRDPET 2013 and the other from ICEAT 2012 Conference. He has also applied for Four Patents in Indian Patents office Located at
Chennai, Tamilnadu, India, two with Kausal Malladi, one with Yamini Muralitharan and
one individually.
Kausal Malladi. A B.Tech. Graduate in Information Technology, result-driven engineer
with a traction towards open source, visionary entrepreneur, currently pursuing Masters in
Computer Science from International Institute of Information Technology, Bangalore
(IIIT-B). Passionate about technology, looking forward to solve few of the existing
problems in the arena of computers by extending expertise in Operating Systems, Data
Management and Theoretical Computer Science. He has applied for two patents jointly
with Srivatsan Sridharan.
Yamini Muralitharan. Yamini is a young academician who is much passionate towards
research in Software Engineering and Testing. She completed her Bachelors from College
of Engineering - Guindy - Anna University, Tamilnadu. She is currently pursuing her
Masters from IIIT - Bangalore. She has also applied for One Patent jointly with Srivatsan
Sridharan at Patents Office - Chennai.

185

More Related Content

What's hot

Fast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data AnalysisFast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data AnalysisIRJET Journal
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
 
Data characterization towards modeling frequent pattern mining algorithms
Data characterization towards modeling frequent pattern mining algorithmsData characterization towards modeling frequent pattern mining algorithms
Data characterization towards modeling frequent pattern mining algorithmscsandit
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique ofIJDKP
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...riyaniaes
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
 
Method for conducting a combined analysis of grid environment’s fta and gwa t...
Method for conducting a combined analysis of grid environment’s fta and gwa t...Method for conducting a combined analysis of grid environment’s fta and gwa t...
Method for conducting a combined analysis of grid environment’s fta and gwa t...ijgca
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudIJAAS Team
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXINGIJDKP
 
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCE
RESEARCH ON DISTRIBUTED SOFTWARE TESTING  PLATFORM BASED ON CLOUD RESOURCERESEARCH ON DISTRIBUTED SOFTWARE TESTING  PLATFORM BASED ON CLOUD RESOURCE
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCEijcses
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...cscpconf
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clusteringpaperpublications3
 

What's hot (17)

Fast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data AnalysisFast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data Analysis
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
 
Data characterization towards modeling frequent pattern mining algorithms
Data characterization towards modeling frequent pattern mining algorithmsData characterization towards modeling frequent pattern mining algorithms
Data characterization towards modeling frequent pattern mining algorithms
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets Analysis
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
Method for conducting a combined analysis of grid environment’s fta and gwa t...
Method for conducting a combined analysis of grid environment’s fta and gwa t...Method for conducting a combined analysis of grid environment’s fta and gwa t...
Method for conducting a combined analysis of grid environment’s fta and gwa t...
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
G1803054653
G1803054653G1803054653
G1803054653
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXING
 
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCE
RESEARCH ON DISTRIBUTED SOFTWARE TESTING  PLATFORM BASED ON CLOUD RESOURCERESEARCH ON DISTRIBUTED SOFTWARE TESTING  PLATFORM BASED ON CLOUD RESOURCE
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCE
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 

Viewers also liked

Μεγάλοι επιστήμονες
Μεγάλοι επιστήμονεςΜεγάλοι επιστήμονες
Μεγάλοι επιστήμονεςmariatsilopoulou
 
11-11-13 Expansión. David Navarro: "Las redes sociales cotizadas, muy poco pa...
11-11-13 Expansión. David Navarro: "Las redes sociales cotizadas, muy poco pa...11-11-13 Expansión. David Navarro: "Las redes sociales cotizadas, muy poco pa...
11-11-13 Expansión. David Navarro: "Las redes sociales cotizadas, muy poco pa...Inversis Banco
 
Trabalho elen e yasmin
Trabalho elen e yasminTrabalho elen e yasmin
Trabalho elen e yasminelenpinheiro
 
No museu da cidade de aveiro acontece novembro
No museu da cidade de aveiro acontece novembroNo museu da cidade de aveiro acontece novembro
No museu da cidade de aveiro acontece novembroMuseu Cidade
 
Documentary analysis
Documentary analysisDocumentary analysis
Documentary analysisclobebobe
 
SSE MGM Program Live Case Overview Nov 2013
SSE MGM Program Live Case Overview Nov 2013SSE MGM Program Live Case Overview Nov 2013
SSE MGM Program Live Case Overview Nov 2013Robin Teigland
 
Stb izm2 912-98-bylki-proekt
Stb izm2 912-98-bylki-proektStb izm2 912-98-bylki-proekt
Stb izm2 912-98-bylki-proektQuinn Kane
 
приглашение на 180
приглашение на 180приглашение на 180
приглашение на 180lyudmilaAlexeeva
 

Viewers also liked (17)

Μεγάλοι επιστήμονες
Μεγάλοι επιστήμονεςΜεγάλοι επιστήμονες
Μεγάλοι επιστήμονες
 
Communication skills
Communication skillsCommunication skills
Communication skills
 
Uci sin paredes gertech
Uci sin paredes gertechUci sin paredes gertech
Uci sin paredes gertech
 
Doc analysis 1
Doc analysis 1Doc analysis 1
Doc analysis 1
 
Polizia
PoliziaPolizia
Polizia
 
株の小学校
株の小学校株の小学校
株の小学校
 
11-11-13 Expansión. David Navarro: "Las redes sociales cotizadas, muy poco pa...
11-11-13 Expansión. David Navarro: "Las redes sociales cotizadas, muy poco pa...11-11-13 Expansión. David Navarro: "Las redes sociales cotizadas, muy poco pa...
11-11-13 Expansión. David Navarro: "Las redes sociales cotizadas, muy poco pa...
 
Trabalho elen e yasmin
Trabalho elen e yasminTrabalho elen e yasmin
Trabalho elen e yasmin
 
Stb urov
Stb urovStb urov
Stb urov
 
No museu da cidade de aveiro acontece novembro
No museu da cidade de aveiro acontece novembroNo museu da cidade de aveiro acontece novembro
No museu da cidade de aveiro acontece novembro
 
Documentary analysis
Documentary analysisDocumentary analysis
Documentary analysis
 
SSE MGM Program Live Case Overview Nov 2013
SSE MGM Program Live Case Overview Nov 2013SSE MGM Program Live Case Overview Nov 2013
SSE MGM Program Live Case Overview Nov 2013
 
Stb izm2 912-98-bylki-proekt
Stb izm2 912-98-bylki-proektStb izm2 912-98-bylki-proekt
Stb izm2 912-98-bylki-proekt
 
приглашение на 180
приглашение на 180приглашение на 180
приглашение на 180
 
Ts t 2510
Ts t 2510Ts t 2510
Ts t 2510
 
Gamechangers presentation
Gamechangers presentationGamechangers presentation
Gamechangers presentation
 
Tkp e ecol
Tkp e ecolTkp e ecol
Tkp e ecol
 

Similar to Data mining model for the data retrieval from central server configuration

RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...cscpconf
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET Journal
 
MongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionMongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionHéliot PERROQUIN
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463IJRAT
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportAhmad El Tawil
 
IRJET - Text Summarizer.
IRJET -  	  Text Summarizer.IRJET -  	  Text Summarizer.
IRJET - Text Summarizer.IRJET Journal
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek outiaemedu
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkLog Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkIRJET Journal
 
Database Engine Control though Web Portal Monitoring Configuration
Database Engine Control though Web Portal Monitoring ConfigurationDatabase Engine Control though Web Portal Monitoring Configuration
Database Engine Control though Web Portal Monitoring ConfigurationIRJET Journal
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith Kumar Pampatti
 
PRESS MANAGEMENT Documentation
PRESS MANAGEMENT DocumentationPRESS MANAGEMENT Documentation
PRESS MANAGEMENT Documentationanuj_rakheja
 
APPLICATION DEVELOPMENT TO CONVERT HETEROGENEOUS INFORMATION INTO PQDIF (POWE...
APPLICATION DEVELOPMENT TO CONVERT HETEROGENEOUS INFORMATION INTO PQDIF (POWE...APPLICATION DEVELOPMENT TO CONVERT HETEROGENEOUS INFORMATION INTO PQDIF (POWE...
APPLICATION DEVELOPMENT TO CONVERT HETEROGENEOUS INFORMATION INTO PQDIF (POWE...ijcsit
 

Similar to Data mining model for the data retrieval from central server configuration (20)

RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
MongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionMongoDB What's new in 3.2 version
MongoDB What's new in 3.2 version
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
IRJET - Text Summarizer.
IRJET -  	  Text Summarizer.IRJET -  	  Text Summarizer.
IRJET - Text Summarizer.
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek out
 
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkLog Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and Spark
 
GCF
GCFGCF
GCF
 
Database Engine Control though Web Portal Monitoring Configuration
Database Engine Control though Web Portal Monitoring ConfigurationDatabase Engine Control though Web Portal Monitoring Configuration
Database Engine Control though Web Portal Monitoring Configuration
 
Job portal
Job portalJob portal
Job portal
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
Computers in management
Computers in managementComputers in management
Computers in management
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
 
PRESS MANAGEMENT Documentation
PRESS MANAGEMENT DocumentationPRESS MANAGEMENT Documentation
PRESS MANAGEMENT Documentation
 
APPLICATION DEVELOPMENT TO CONVERT HETEROGENEOUS INFORMATION INTO PQDIF (POWE...
APPLICATION DEVELOPMENT TO CONVERT HETEROGENEOUS INFORMATION INTO PQDIF (POWE...APPLICATION DEVELOPMENT TO CONVERT HETEROGENEOUS INFORMATION INTO PQDIF (POWE...
APPLICATION DEVELOPMENT TO CONVERT HETEROGENEOUS INFORMATION INTO PQDIF (POWE...
 

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Data mining model for the data retrieval from central server configuration

  • 1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 Data Mining Model for the Data Retrieval from Central Server Configuration 1 Srivatsan Sridharan , Kausal Malladi and Yamini Muralitharan 2 1 2 Department of Computer Science, IIIT - Bangalore, India. Department of Software Engineering, IIIT - Bangalore, India. ABSTRACT A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most relevant and updated for continuous text search queries. This paper focuses on handling continuous text extraction sustaining high document traffic. The main objective is to retrieve recent updated documents that are most relevant to the query by applying sliding window technique. Our solution indexes the streamed documents in the main memory with structure based on the principles of inverted file, and processes document arrival and expiration events with incremental threshold-based method. It also ensures elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are ranked based on user feedback and given higher priority for retrieval. KEYWORDS Continuous Text Queries, MapReduce Technique, Sliding Window. 1. INTRODUCTION Data intensive applications such as electronic mail, news feed, telecommunication management, automation of business reporting etc raise the need for a continuous text search and monitoring model. In this model the documents arrive at the monitoring server as in the form of a stream. Each query Q continuously retrieves, from a sliding window of the most recent documents, the k that is most similar to a fixed set of search terms. Sliding window. This window reflects the interest of the users in the newest available documents. It can be defined in two alternative ways. They are a) count-based window contains the N most recent documents for some constant number N, b) timebased window contains only documents that arrived within last N time units. Thus, although a document which may be relevant to a query, it is ignored, because it may not satisfy the time and count constraints of the user. Incremental threshold method. The quintessence of the algorithm is to employ threshold-based techniques to derive the initial result for a query, and then continue to update the threshold to reflect document arrivals and expirations. At its core lies a memory-based index similar to the conventional inverted file, complimented with fast updated techniques. MapReduce technique. MapReduce is a powerful platform for large scale data processing. This technique involves two steps namely a) map step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level structure. The worker node processes the smaller problem, and passes the answer back to its master node, b) reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Unsupervised duplicate detection. [3] The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Here this method is used to identify documents that are all DOI : 10.5121/ijcsit.2013.5514 177
  • 2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 alike and prevent them from being prepared in the result set. Our paper also focuses on ranking the documents based on user feedback. The user is allowed to give feedback for each document that has been retrieved. This feedback is used to rank the document and hence increase the probability of the document to appear in the sliding window. Visual Web Spider is a fully automated, multi-threaded web crawler that allows us to index and collect specific web pages on the Internet. Once installed, it enables us to browse the Web in an automated manner, indexing pages that contain specific keywords and phrases and exporting the indexed data to a database on our local computer in the format of our choice. We want to collect website links to build our own specialized web directory. We can configure Visual Web Spider automatically. This program’s friendly, wizard-driven interface lets us customize our search in a step-by-step manner. To index relevant web pages, just follow this simple sequence of steps. After opening the wizard, enter the starting web page URL or let the program generate URL links based on specific keywords or phrases. Then set the crawling rules and depth according to your search strategy. Finally, specify the data you want to index and your project filename. That’s pretty much it. Clicking on ‘Start’ sets the crawler to work. Crawling is fast, thanks to multithreading that allows up to 50 simultaneous threads. Another nice touch is that Visual Web Spider can index the content of any HTML tag such as: page title (TITLE tag), page text (BODY tag), HTML code (HTML tag), header text (H1-H6 tags), bold text (B tags), anchor text (A tags), alt text (IMG tag, ALT attribute), keywords, description (META tags) and others. This program can also list each page size and last modified date. Once the web pages have been indexed, Visual Web Spider can export the indexed data to any of the following formats: Microsoft Access, Excel (CSV), TXT, HTML, and MySQL script. 1.1.Key Features A Personal, Customizable Web crawler. Crawling rules. Multi-threaded technology (up to 50 threads). Support for the robots exclusion protocol/standard (Robots.txt file and Robots META tags);Index the contents of any HTML tag. Indexing rules; Export the indexed data into Microsoft Access database, TEXT file, Excel file (CSV), HTML file, MySQL script file; Start crawling from a list of the URLs specified by user; Start crawling using keywords and phrases; Store web pages and media files on your local disk; Auto-resolve URL of redirected links; Detect broken links; Filter the indexed data; 2. EXISTING SYSTEM Drawbacks of the existing servers that tend to handle the heavy document traffic are: Cannot efficiently monitor the data stream that has highly dynamic document traffic. The server alone does the processing hence it involves large amount of time consumption. In case of continuous text search queries and extraction every time the entire document set has to be scanned in order to find the relevant documents. There is no confirmation that duplicate documents are not retrieved for the given query. A large amount of documents cannot be stored in the main memory as it involves large amount of CPU cost. Naïve solution: The most straightforward approach to evaluate the continuous queries defined above is to scan the entire window contents D after every update or in fixed time intervals, compute all the document scores, and report the top-k documents. This method incurs high processing costs due to the need for frequent re computations from scratch. 178
  • 3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 3. PROPOSED SYSTEM 3.1.Problem Formulation In our model, a stream of documents flows into a central server. The user registers text queries at the server, which is then responsible for continuously monitoring/reporting their results. As in most stream processing systems, we store all the data in main memory in order to cope with frequent updates, and design our methods with the primary goal of minimizing the CPU cost. Moreover it is necessary to reduce the work load of the monitoring server. 3.2.Proposed Solution In our solution we use the MapReduce technique in order to reduce the work load of the central server, where the server acts as the master node, which splits up the processing task to several worker nodes. The number of worker nodes, which have been assigned the processing task, depends on the nature of query that has been put up by the user. Here the master node, upon receiving a query from the user, assigns the workers to find the relevant result query set and return the solution to the master node. The master node, after receiving the partial solutions from the workers, integrates the results to produce the final result set for the given query. This can be viewed schematically in the following Fig.1. Each worker/slave node is responsible uses the incremental threshold algorithm for computing the result set of k relevant and recent documents for the given query. The overall system architecture can be viewed as in the following Fig.2 Figure.2. System Architecture for the proposed Data Retrieval Model. 179
  • 4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 Fig. 2. A data Retrieval system using MapReduce. Each element of the input stream comprises of a document d, a unique document identifier, the document arrival time, a composition list. The composition list contains one (t, wdt) pair for each term t belonging to T in the document and wdt is the frequency of the term in the document d. The notations in this model are as follows in Fig 3. Figure. 3. A Detailed list of the notations. The worker node maintains an inverted index for each term t in the document. With the inverted index, a query Q is processed as follows: the inverted lists for the terms t belonging to Q are scanned and the partial wdt scores of each encountered document d are accumulated to produce S(d/Q). The documents with the highest scores at the end are returned as the result. 3.3.Incremental Threshold Algorithm Fig.3 represents the data structures that have been used in this system. The valid documents D are stored in a single list, shown at the bottom of the figure. Each element of the list holds the stream of information of document (identifier, text content, composition list, arrival time). D contains the most recent documents for both count-based and time-based windows. Since documents expire in 180
  • 5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 first-in-first-out manner, D is maintained efficiently by inserting arriving documents at the end of the list and deleting expiring ones from its head. On the top of the list of valid documents we build an inverted index. The structure at the top of the figure is the dictionary of search terms. It is an array that contains an entry for each term t belonging to T. The dictionary entry for t stores a pointer to the corresponding inverted list Lt. Lt holds an impact entry for each document d that contains t, together with a pointer to d’s full information in the document list. When a document d arrives, an impact entry (d, wdt) (derived from d’s composition list) is inserted into the inverted list of each term t that appears in d. Likewise, the impact entries of an entries of an expiring document are removed from the respective inverted lists. To keep the inverted lists sorted on wdt while supporting fast (logarithmic) insertions and deletions. Initial Top-k Search: When a query is first submitted to the system, its top-k result is computed using the initial search module. The process is an adaptation of the threshold algorithm. Here, the inverted lists Lt of the query terms play the role of the sorted attribute lists. Unlike the original threshold algorithm, however we do not probe the lists in a round robin fashion. Since the similarity function associates different weights wQt with the query specifically, inspired by [4], we probe the list Lt with the highest ct=wQt.wdnxt t value, where dnxt is the next document in Lt. The global threshold gt, a notion used identically to the original algorithm, is the sum of ct values for all the terms in Q. Consider query Q1 with search string “red rose” and k=2. Let term t20=”red” and t11=”rose”. First the server identifies the inverted lists L11 and L20 (using the dictionary hash table), and computes the values c11 =wQ1t11 .wd7t11 and c20=wQ1t20.wd6t20. In iteration 1, since c20 is larger, the first entry of L20 is popped; the similarity score of the corresponding document, d6, is computed by accessing its composition list in D and inserted into the tentative R. c20 is then updated to impact entry which is above local threshold, but we would still include it in R as unverified entry. The algorithm is as follows, Algorithm Incremental Threshold with Duplicate Detection (Arriving dins, Expiring ddel) 1: Insert document dins into D (the system document list) 2: for all terms t in the composition list of dins do 3: for all documents in Lt 4: for all terms t in dins 5: Compute unique (dins) 6: wdinst != wdnxtt 7: Insert the impact entry of dins into Lt 8: Probe the threshold tree of Lt 9: for all queries Q where wdinst > =localThreshold do 10: if Q has not been considered for dins in another Lt then 11: Compute S (dins/Q) 12: Insert dins into R 13: if S(dins/Q)>= old Sk then 14: Update Sk (since dins enters the top-k result) 15: Keep rolling up local thresholds while r <= Sk 16: Set new τ as influence threshold for Q 17: Update local thresholds of Q 18: Delete document ddel from D (the system document list) 19: for all terms t in the composition list of ddel do 20: Delete the impact entry of ddel from Lt 21: Probe the threshold tree of Lt 22: for all queries Q where wddelt >= localThreshold do 23: if Q has not been considered for ddel in another Lt then 24: Delete ddel from R 181
  • 6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 25: if S(ddel/Q) >= old Sk then 26: Resume top-k search from local thresholds 27: Set new τ as influence threshold for Q 28: Update local thresholds of Q After constructing the initial result set R using the above algorithm, only the documents that have a score higher than or equal to the influence threshold t(tow) are verified. The main key point is that no duplicate documents from the part of the result set R. This is ensured using unsupervised duplication detection. The idea of unsupervised learning for duplicate detection has its roots in the probabilistic model proposed by Fellegi and Sunter. When there is no training data to compute the probability estimates, it is possible to use variations of the Expectation Maximization algorithm to identify appropriate clusters in the data. Figure. 4. Data Structures used for Incremental Threshold Algorithm. 4. PERFORMANCE AND IMPLEMENTATION ISSUES In Figure.5 we empirically compare Incremental Threshold Algorithm (ITA) against Na¨ve. We ı enhance Na¨ve with the technique of, which retrieves the top-kmax documents (for an ı analytically derived kmax that is larger than k) whenever the result is computed from scratch, in order to reduce the frequency of subsequent recomputations. We form a document stream, which comprises 172,961 articles. After standard stopword removal, the dictionary 182
  • 7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 contains 181,978 terms. The documents are streamed into the monitoring system, following a Poisson process with a mean arrival rate of 200 documents/second. We generate 1,000 queries with k = 10 and terms selected randomly from the dictionary. We use a count- based window; the results for a time-based one are similar. First, we investigate the effect of the number of search terms n on the performance of ITA and Na¨ve. In Figure 5(a), we set the window size ı to 1,000 documents, and vary n from 4 to 40. We measure the average processing time, i.e., the elapsed time between the arrival of a new document (which additionally causes the expiration of an existing one) and the point where all the query results are updated accordingly. The measurements are in milliseconds and are plotted in logarithmic scale. With more search terms (i.e., larger n), an arriving/expiring document has a higher chance of sharing common terms with the queries. This leads to an increase in the number of queries that need updating, and thus to a longer processing time. ITA is about 10 times faster than Na¨ve for queries comprising 4 search ı terms, and 6 times faster for 40- term queries. Querylength Processing time for naïve in msec Processing time for ITA in msec 3 0.9 0.07 8 1.5 0.2 13 3 0.4 18 8 0.7 23 13 1.0 Table 1. Data for Query Length and processing Time. Figure. 5a. Comparison of Query Length and processing Time. In Figure 5(b), we study the effect of the sliding window size N . We set the query length to 10 terms, and vary N from 10 to 100,000 documents. A larger window holds more valid documents in the system. For Na¨ve this imposes a higher cost whenever the result needs ı to be recomputed, because it scans the entire D. For ITA, the inverted lists grow longer, leading to higher index update cost and slower arrival/expiration handling. ITA is 13 times faster than Na¨ve for a window size of 10, and 18 times faster when the sliding window ı comprises 10,000 documents. Note that the last measurement for Na¨ve is missing, because for ı N > 10000 the CPU utilization approaches 100% and the system becomes unstable. The above experiments, as well as others omitted due to lack of space, verify the general superiority of ITA over Na¨ve. ı 183
  • 8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 Window size Processing time for naïve in msec Processing time for ITA in msec 5 10 5 13 20 7 25 30 13 30 40 15 Table 1. Data for Window Size and processing Time. Figure. 5b. Comparison of Window Size and processing Time. 5. CONCLUSION In this paper, we study the processing of continuous text queries over document streams. These queries define a set of search terms, and request continual monitoring of a ranked list of recent documents that are most similar to those terms. The problem arises in a variety of text monitoring applications, e.g., email and news tracking. To the best of our knowledge, this is the first attempt to address this important problem. Currently, our study focuses on plain text documents. A challenging direction for future work is to extend our methodology to documents tagged with metadata and documents with a hyperlink structure, as well as to specialized scoring mechanisms that may apply in these settings. ACKNOWLEDGEMENTS The authors would like to thank Prof. R. Parvatham Assistant Professor, Sri SaiRam Engineering College for her continuous support in implementing this work. REFERENCES [1] [2] [3] [4] B.Babcock, S. Babu, M.Datar, R.Motwani, and J.Widom, 2002, “Models and Issues in Data Streaming System” PODS’02, 1-16. J.Zobel and A.Moffat, July 2006, “Inverted Files for Text Search Engines”, Computing Surveys, vol.38, o.2, p1-55. VNAnh and A.Moffat, 2002, “Impact Transformation: Effective Efficient Web Retrieval”, Int’l ACM SIGIR conf. Research and Development in Information Retrieval. V.N.Anh, O.de Krestser, and A. Moffat, 2001, “Vector-Space Ranking with Effective Early Termination”, ACM SIGIR conf. Research and Development in Information Retrieval. 184
  • 9. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 [5] Y. Zhang and J.Callan, 2001 “Maximum Likelihood Estimation for Filtering Tresholds,” ACM SIGIR conf. Research and Development in Information Retrieval. [6] M. Persin, J. Zobel and R. Sacks-Davis, 1996, “Filtered Document Retrieval with Frequency_Sorted Indexes”, J.Am.Soc for Information Science, vol.47, no.10. [7] Hemalatha S, Raja k, Tholkappia Arasu , 2011, “Duplicate Detection of web results from multiple databases”, Computational Science- new dimensions and perspectives, NCCSE. [8] Kyriakos Mouraditis, Spiridon Bakiras, Dimitris Papadias, “Continuous Monitoring of top-K queries sliding window”. [9] Kyriakos Mouraditis and Hweehwa Pang, October 2011 “Efficient Evaluation of Continuous Text search queries” , IEEE Transactions on Knowledge and Data Engineering, Vol 20. [10] Chen He, Ying Lu, David Swanson “Matchmaking: a New Mapreduce Scheduling Technique”, University of Nebraska-Lincoln, Lincoln, U.S Authors Srivatsan Sridharan. Srivatsan is a young academician who is highly passionate to learn and adapt himself to newer technologies. Completed his Bachelor of Engineering from Sri SaiRam Engineering College affiliated to Anna University - Chennai (Ranked 26th among all the Affiliated Institutions in the Dept. of Computer Science), he is currently pursuing his M.Tech from IIIT - Bangalore. He has been awarded thrice the Best Paper Award, One from ICCCAN 2013, another from IEEE - ICRDPET 2013 and the other from ICEAT 2012 Conference. He has also applied for Four Patents in Indian Patents office Located at Chennai, Tamilnadu, India, two with Kausal Malladi, one with Yamini Muralitharan and one individually. Kausal Malladi. A B.Tech. Graduate in Information Technology, result-driven engineer with a traction towards open source, visionary entrepreneur, currently pursuing Masters in Computer Science from International Institute of Information Technology, Bangalore (IIIT-B). Passionate about technology, looking forward to solve few of the existing problems in the arena of computers by extending expertise in Operating Systems, Data Management and Theoretical Computer Science. He has applied for two patents jointly with Srivatsan Sridharan. Yamini Muralitharan. Yamini is a young academician who is much passionate towards research in Software Engineering and Testing. She completed her Bachelors from College of Engineering - Guindy - Anna University, Tamilnadu. She is currently pursuing her Masters from IIIT - Bangalore. She has also applied for One Patent jointly with Srivatsan Sridharan at Patents Office - Chennai. 185