SlideShare a Scribd company logo
1 of 5
Download to read offline
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611
Volume 5 Issue 2, February 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
An Improved Mining of Biomedical Data from Web
Documents Using Clustering
Nikita Gupta1
, Gunjan Pahuja2
1
Department of Computer Science, A.P.J. Abdul Kalam University, Lucknow, U.P., India
Abstract: Now a day’s web is the main source of information in every field. Web is also expanding exponentially day by day. To get the
relevant information is very time consuming and is not a very easy task. Mainly users go for the various search engines to search any
information. But sometimes search engines are not able to give the useful results as most of the web documents are present in
unstructured manner. Data mining is extraction of information from large database. Text mining uses the many techniques of data
mining. In web, biomedical documents are also increasing at a very fast pace but most of them are unstructured text. These documents
can be very helpful in diagnostics, treatment and prevention of any disease. There are we millions of documents on internet about a
specific term so to obtain a relevant document is very difficult. The goal of this is to apply text mining techniques to retrieve useful
biomedical web documents. Here a more efficient mechanism is proposed which uses the optimised K-means clustering algorithm where
it can group the similar documents in one place. This approach will help the user to get all the relevant biomedical documents at one
place. On comparing our approach with the original k-means algorithm and found that our algorithm on an average giving 99.06 % F-
measure.
Keywords: Data Mining, Biomedical Data, Clustering, K-means.
1. Introduction
Biomedical information search over search engines results in
large number of query results. These results are not always
relevant or sometimes fail to provide the information user
looking for. Most of the explicit knowledge is stored in
different types of documents but only a few people (often
only the authors of the documents) know where to locate
them. However, this information is often scattered among
many web servers and hosts, using many different formats
.The manual organization of these documents can be time-
consuming to create and cannot usually be used in practice.
Some websites use so many fake links to increase their
importance. The famous search engine Google uses the
ranking algorithm like PageRank [19] and HITS (Hypertext
Induced Topic search) [20] to give the query results. Some
web pages have too many links of advertisements, which
may not relevant to the user. It is the user, and only the user
who can say whether a given item of information is relevant
to a query issued to a web search engine or to a private
folder in which documents should be filtered according to
the content.
Biomedical terms are different from the normal English
language words [1]. So to get the domain specific
information is very difficult now days. Recently Biomedical
Information is very increasing very rapidly on web. Most of
this information on web is stored in unstructured way. The
need of automatically retrieval of useful knowledge from the
huge amount of textual data in order to assist the human
analysis is fully needed. Data Mining discovers the
previously unknown and potentially useful information from
data in databases. Text mining is the extension of data
mining. Text mining has been one of the fastest growing
research fields for the past few decades. There are various
text mining techniques available named by Information
retrieval, Information extraction, text classification, text
Clustering and so on [3].
For application developers, this interest is mainly due to the
enormously increasing need to handle larger and larger
quantities of biomedical documents, a need emphasized by
increased connectivity and availability of document bases of
all types at all levels in the information chain. Biomedical
text mining generally has five phases: Text Gathering, Text
Preprocessing, Data Analysis, Visualization, and Evaluation
[18]. Text gathering aims to get desired text on a certain
topic; In text preprocessing tokenization, part of speech
tagging stopwords removal etc is done ; In data analysis
actual information extraction is done ; Visualization is how
to visual the results obtained; At last evaluation is done.
Clustering is an unsupervised technique used to discover
new set of categories. Here the proposed system gives an
optimized k-means algorithm to make cluster of the
biomedical documents so that the documents would be more
relevant and informative.
In this paper to identify the biomedical documents a UMLS
(Unified Medical Language System) is used. This will
conclude that if the document contains the biomedical term
then the document is of biomedical domain. The proposed
system finds out the biomedical terms in web documents and
classifies the documents on the basis of it. Then the proposed
system forms the cluster of similar and related documents
using optimised K-means algorithm. Our approach will
discard the non biomedical documents during classification
of web documents.
2. Literature Survey
The Literature [16] describes that World Wide Web is the
most worthful resources for information retrieval due to
increase in amount of information available online. Web
Mining technologies are the correct solution for discovering
the relevant data. The paper provides introduction of Web
mining as well as discuss the web mining categories.
Paper ID: NOV161318 1396
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611
Volume 5 Issue 2, February 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Mostly search engines are used for obtaining the relevant
information. Search Engines are the programs that search
documents related to the queries keyword and returns the list
of documents where the keyword found. Pooja et al. [4]
describes that Google uses the two link-based ranking
algorithms, PageRank [19] and HITS [20] with its
advantages and disadvantages.
The literature [2] describes that the huge size of biomedical
literature and its speedy growth in last few decades makes
searching the relevant information a needy task. Obtaining
and reading relevant information in literature is crucial for
any researcher in life sciences. Fei et al. [5] discusses the
text mining application in cancer research as cancer is a
malignant disease and biomedical text has large value for its
diagnostic, treatment and prevention. The paper also
provides resources for cancer text mining and gives the
general workflow of text mining in cancer system biology.
Ravi et al. [2] gives an efficient approach using NLP to get
correct biomedical document from web by counting the
number of biomedical terms in the documents with the help
of UMLS ontology and finally ranking the documents
according to them.
Wang et al. [11] proposed a similarity method for medical
literature searches the normalised MEDLINE distance based
on the biomedical literature knowledge source MEDLINE.
This model is to refine the search process using user
interests. User interests are analyzed to calculate semantic
similarity among interest terms.
A new text mining algorithm to increase the performance of
information retrieval system for Medline and Pubmed has
been proposed by Sagar et al. [6]. The unique term are
weighted by using novel global relevant weighing schema.
The biomedical text retrieval is the techniques applied to
biomedical resources to extract information required as
published biomedical research is very large. Sumit et al. [9]
proposed a technique of soft clustering data mining
algorithm to increase the accuracy of biomedical text
extraction by discovering the predictive relationships
between the different pieces of extracted data.
Ontology is like dictionary or glossary but with more details
like properties and interrelationships of the entities that
exists for a particular domain. Tan et al. [12] showed that
text mining of biomedical documents require substantial
amount of domain knowledge. The paper proposes a model
for choosing the most appropriate ontology for particular
application.
3. Background
3.1 K-means Clustering Algorithm
K-means is a widely used popular partitional and
unsupervised technique used for document clustering due to
its simplicity [21]. K-means algorithm uses an iterative
process in order to cluster database. If the algorithm is
required to produce K clusters then there will be K initial
means and K final means. If K is the desired number of
clusters, then partitional approaches typically find all K
clusters at once. K-means is based on the idea that a center
point can represent a cluster.
K-means algorithm involves the following steps:
Input: Collection of HTML documents.
Output: Clusters of documents.
Algorithm
1)Load all documents.
2)Remove stopwords and html tags.
3)Set the value of k.
4)Select k points at random as cluster centers.
5)Calculate the cosine similarity based on TF/IDF for each
documents.
6)Assign documents to their closest cluster center according
to the cosine similarity.
7)Recalculate the new cluster center.
8)If no document was reassigned then stop, otherwise repeat
from step 6.
3.2 UMLS
UMLS stands for the Unified Medical Language System is a
collection of many controlled vocabularies in biomedical
sciences [23]. UMLS is very large and contains terms related
to biomedical and health. It is a wide range of general and
specialized biomedical terminologies. It is used in variety of
purposes like Information retrieval, Public Health Reporting,
Clinical Health and Research. UMLS further provides
facilities for natural language processing. It is intended to be
used mainly by developers of system in medical informatics.
The number of biomedical resources available to researchers
is very large. Often the problem arises when the medical
literature is searched as large volume of documents
retrieved. The purpose of the UMLS is to enhance access to
this literature by facilitating the development of computer
systems that understand biomedical language.
3.3 Vector Space Model
In this work documents are represented using the vector-
space model (VSM) for clustering algorithm. VSM is most
widely used algebraic model for representing text documents
as vector of identifiers. In this model, each document, d, is
considered to be a vector, d, in the term-space. Each
document is represented by the (TF) vector, (tf1,
tf2,…, tfn), where tfi is the frequency of the ith term in the
document. In addition, we use the version of this model that
weights each term based on its inverse document frequency
(IDF) in the document collection. The weight associated
with each keyword determines the relevance of the keyword
in the document. The similarity between two documents can
be measured in various ways. There are a number of possible
measures for computing the similarity between the
documents. The most common one is the cosine measure,
which is defined as:
Cosine-similarity (d1, d2) = (d1 • d2) / ||d1|| ||d2||
Paper ID: NOV161318 1397
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611
Volume 5 Issue 2, February 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Where • indicates the vector dot product and ||d|| is the
length of vector d. The strength of the similarity depends on
the value of Ɵ. For K-means clustering, the cosine measure
is used to compute which document centroid is closest to a
given document.
4. Proposed Model
Following is our proposed system of mining the biomedical
documents. Briefly describing the various phases as follows:
4.1 Document Collection
Firstly collect several number of HTML documents using
search engines. Then the HTML documents are converted to
simple text documents .Text Documents are used to perform
the further process.
4.2 Documents Preprocessing
The collected documents are preprocessed to perform the
effective documents search and to improve the efficiency of
the system. There are abundant unnecessary texts on web
documents like articles, be verbs, preposition, conjunction,
and punctuations etc, which are categorized as stop words.
There are large quantities of stop words appear in the
document and typically have no use in this research. The
stopword are removed and the documents are simplified.
Figure1: Proposed System Arcitecture
4.3 Classification of Documents
Here in this phase the documents are classified into two
categories i.e. biomedical documents and Non-biomedical
Documents. This classification is done to extract the
biomedical documents from all the documents. Non-
biomedical documents are discarded. The classification of
documents is done with the help of the UMLS.
4.4 Clustering of the Documents
The next step is to make cluster of similar and related
documents. The improved K-means algorithm is used to
make the clusters of biomedical documents.
5. Experimental Results
To obtain the similar and related biomedical documents we
have collected approximately 100 web documents using
Google search engine. After preprocessing, documents are
classified into biomedical documents and non-biomedical
documents as shown in fig.2.
Figure 2: Classification of Documents
Non biomedical documents are discarded while Clusters of
biomedical documents using K-means clustering. We tested
the proposed system for both original K-means algorithm
and improved K-means algorithm. We found that our
improved approach takes less execution time. The
comparisons of results are shown in fig.3.
Figure 3: Showing Comparison of Execution time
Three performance measures precision, recall and F-measure
have been calculated for both the approaches. The tested
results of performance measures for demonstration are
shown in table 1. The use of this is to obtain the similar
documents in a group. So that finding the relevant
information is not time consuming.
Paper ID: NOV161318 1398
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611
Volume 5 Issue 2, February 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Table 1: Comparison of Performance Measures
6. Conclusion
The main aspect is retrieving the related biomedical
documents in same group so that there is no need to go
through all the documents. This paper presents improve the
K-means algorithm so that it can perform clustering of
biomedical documents in reasonable durations. It has been
observed that selection of all words also affects the
convergence time of K-means algorithm. The experimental
results show that the execution time of the improved
algorithm has become less or approximately half of the run
time as in original. This shows a technique for selecting
better initial points than random ones. We found that our
approach is giving better performance results. This work can
be further extended by ranking the documents in each cluster
parallelization of K-means algorithm has been proposed as
an improvement for the algorithm. In future we will consider
the hierarchy of the biomedical documents in each cluster
which will help the user to find all his documents in an
organised and manner.
References
[1] Ravi Shankar Shukla, Kamendra Singh Yadav, Syed
Tarif Rizvi, and Faisal Haseen, ”An Efficient Mining of
Biomedical Data from Hypertext Documents via NLP,”
Proc. of the 3rd
Intell. Comput. (FICTA), Vol.1, pp. 651-
658, 2014.
[2] Chung-Chi Huang and Zhiyong Lu, “Community
challenges in biomedical text mining over 10 years:
success, failure and the future,” Briefings in
Bioinformatics Advance Access, May 1, 2015.
[3] Dr. Shilpa Dang and Peerzada Hamid Ahmad,” A
Review of Text Mining Techniques Associated with
Various Application Areas,” International Journal of
Science and Research (IJSR), Volume 4, no. 2, pp.
2461-2466, February, 2015.
[4] Pooja Devi1, Ashlesha Gupta, and Ashutosh Dixit,”
Comparative Study of HITS and PageRank Link based
Ranking Algorithms,” International Journal of
Advanced Research in Computer and Communication
Engineering, Vol. 3, no. 2, February, 2014.
[5] Fei Zhu , Preecha Patumcharoenpol , Cheng Zhang ,
Yang Yang , Jonathan Chan , Asawin Meechai ,
Wanwipa Vongsangnak , and Bairong Shen, “
Biomedical text mining and its applications in cancer
research,” Journal of Biomedical Informatics, Vol. 46,
no. 2, pp. 200–211, 2013.
[6] S. Sagar Imambi and T. Sudha, “Extraction of
Biomedical Information from MEDLINE Documents –
A Text Mining Approach,” International Journal of
Science, Environment and Technology, Vol. 2, no. 2,
pp. 267 – 274, April, 2013.
[7] Shally and Rejimol Robinson, “Survey for Mining
Biomedical data from HTTP Documents,” International
journal of Engineering Sciences & Research
Technology, pp.165-169, Feburary, 2013.
[8] Hui Yang and Yan Dong,” Recognizing Hierarchically
Related Biomedical Entities Using MeSH-Based
Mapping,” Tsinghua Science and Technology, Vol. 17,
no. 6, pp. 609-618 December, 2012.
[9] Sumit Vashishta and Dr. Yogendra Kumar Jain,”
Efficient Retrieval of Text for Biomedical Domain using
Data Mining Algorithm,” International Journal of
Advanced Computer Science and Applications
(IJACSA), Vol. 2, No. 4, pp. 77-80, 2011.
[10]Sazia Salah uddin and Rashedur M Rahman, “Mining
Biomedical Data from Hypertext Documents, “ Proc. of
14th Int. Conf. on Computer and Information
Technology , ICCIT 2011, pp. 417-422, Dhaka,
Bangladesh , December 22-24, 2011.
[11]WANG Yan, WANG Cong, ZENG Yi, HUANG
Zhisheng, Vassil Momtchev, Bo Andersson, REN Xu
,and ZHONG Ning,” Normalized MEDLINE Distance
in Context-Aware Life Science Literature Searches”,
Tsinghua Science and Technology, Vol. 15, no. 6,
lpp.709-715, December, 2010.
[12]He Tan and Patrick Lambrix, “Selecting Ontology for
Biomedical Text Mining,” Workshop on BioNLP, pp.
55–62, Boulder, Colorado, June 2009, Association for
Computational Linguistics, 2009.
[13]Saurav Sahay, Sougata Mukhaerjea, Eugene Agichtein,
Ernest V. Garcia , Shamkant B. Navathe, and Ashwin
Ram,” Discovering Semantic Biomedical Relations
Utilizing the Web,” ACM-Transaction on knowledge
Discovery from Data (TKDD), Vol. 2, no. 1, March,
2008.
[14]SougataMukherjea and Saurav Sahay, “Discovering
Biomedical Relations Utilizing the World-Wide Web,”
Pacific Symposium on Biocomputing 11:164-
175(2006).
[15]Rainer Malik and Arno Siebes, “CONAN: An
Integrative System for Biomedical Literature Mining,”
Progress in Artificial Intelligence, 12th Portuguese
Conference on Artificial Intelligence, EPIA 2005, pp.
248 – 259, Dec. 5-8, 2005.
[16]Miguel Gomes da Costa Junior and Zhiguo Gong,” Web
Structure Mining: An Introduction,” Proc. of the 2005
IEEE Int. Conf. on Information Acquisition, pp. 590-
595, Hong Kong and Macau, China, june 27-july 3,
2005.
[17]Wempu Xing and Ali Ghorbani, “Weighted PageRank
Algorithm,” Proc. of the second Annual Conference on
Communication Networks and Services Research
(CNSR’04), May 19-21, 2004.
[18]Brigitte Mathiak and Silke Eckstein, ”Five Steps to Text
Mining in Biomedical Literature,” Proc. of the Second
European Workshop on Data Mining and Text Mining
in Bioimformatics, pp. 47-50, Pisa, Italy, September 24,
2004.
Paper ID: NOV161318 1399
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611
Volume 5 Issue 2, February 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
[19]Lawrence and Brin, Sergey and Motwani, Rajeev and
Winograd, Terry, "The Page Rank Citation Ranking:
Bringing Order to the Web,” Technical Report. Stanford
InfoLab, 1999.
[20]Jon M. Kleinberg, "Authoritative sources in a
hyperlinked environment,” J. ACM, Vol. 46, no. 5, pp.
604 -632, 1999.
[21]K. A. Abdul Nazeer and M. P. Sebastian, “Improving
the Accuracy and Efficiency of the k-means Clustering
Algorithm,” Proceedings of the World Congress on
Engineering, WCE 2009, Vol.1, pp.308-312, London,
U.K., July 1 - 3, 2009.
[22]H. Shatkay and R. Feldman, "Mining the Biomedical
Literature in the Genomic Era: An Overview,” Journal
of Computational Biology (JCB), Vo.10, Issue: 6, pp.
821-855, July 5, 2004.
[23]Keith E. Cambell, Diane E. Oliver, and Edward H.
Shortlife, “The Unified Medical Language System:
Toward a collaborative Approach for Solving
Terminologic problems,” Journal of the American
Medical Informatics Association, Vol.5, no.1, Jan / Feb,
1998.
[24]Https://www.nlm.nih.gov/research/umls/knowledge_sou
rces/.
Author Profile
Nikita Gupta is a research scholar pursuing M.Tech in
Computer Science & Engineering from JSS Academy
of Technical Education, Noida, U.P., India. She has
completed her B.tech in CSE from Mangalayatan
University, Aligarh, U.P., India in 2012.
Paper ID: NOV161318 1400

More Related Content

Similar to An Improved Mining Of Biomedical Data From Web Documents Using Clustering

A web content mining application for detecting relevant pages using Jaccard ...
A web content mining application for detecting relevant pages  using Jaccard ...A web content mining application for detecting relevant pages  using Jaccard ...
A web content mining application for detecting relevant pages using Jaccard ...IJECEIAES
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringIRJET Journal
 
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...eSAT Journals
 
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...eSAT Publishing House
 
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...eSAT Journals
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
 
Great model a model for the automatic generation of semantic relations betwee...
Great model a model for the automatic generation of semantic relations betwee...Great model a model for the automatic generation of semantic relations betwee...
Great model a model for the automatic generation of semantic relations betwee...ijcsity
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS AIRCC Publishing Corporation
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusAIRCC Publishing Corporation
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
 
Peter Embi's 2011 AMIA CRI Year-in-Review
Peter Embi's 2011 AMIA CRI Year-in-ReviewPeter Embi's 2011 AMIA CRI Year-in-Review
Peter Embi's 2011 AMIA CRI Year-in-ReviewPeter Embi
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...IJNSA Journal
 
Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
 
A scalable hybrid research paper recommender system for micro
A scalable hybrid research paper recommender system for microA scalable hybrid research paper recommender system for micro
A scalable hybrid research paper recommender system for microaman341480
 
A Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisA Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisMichele Thomas
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 

Similar to An Improved Mining Of Biomedical Data From Web Documents Using Clustering (20)

A web content mining application for detecting relevant pages using Jaccard ...
A web content mining application for detecting relevant pages  using Jaccard ...A web content mining application for detecting relevant pages  using Jaccard ...
A web content mining application for detecting relevant pages using Jaccard ...
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document Clustering
 
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
 
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
 
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
 
Great model a model for the automatic generation of semantic relations betwee...
Great model a model for the automatic generation of semantic relations betwee...Great model a model for the automatic generation of semantic relations betwee...
Great model a model for the automatic generation of semantic relations betwee...
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Peter Embi's 2011 AMIA CRI Year-in-Review
Peter Embi's 2011 AMIA CRI Year-in-ReviewPeter Embi's 2011 AMIA CRI Year-in-Review
Peter Embi's 2011 AMIA CRI Year-in-Review
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...
 
Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...
 
A scalable hybrid research paper recommender system for micro
A scalable hybrid research paper recommender system for microA scalable hybrid research paper recommender system for micro
A scalable hybrid research paper recommender system for micro
 
A Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisA Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And Analysis
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 

More from Kelly Lipiec

Buy The Essay Buy Essay Online An
Buy The Essay Buy Essay Online AnBuy The Essay Buy Essay Online An
Buy The Essay Buy Essay Online AnKelly Lipiec
 
Writing A Research Paper For Scholarly Journals
Writing A Research Paper For Scholarly JournalsWriting A Research Paper For Scholarly Journals
Writing A Research Paper For Scholarly JournalsKelly Lipiec
 
Buy College Admission Essa
Buy College Admission EssaBuy College Admission Essa
Buy College Admission EssaKelly Lipiec
 
009 Essay Example Yale Supplement Pa Why Nyu New Yo
009 Essay Example Yale Supplement Pa Why Nyu New Yo009 Essay Example Yale Supplement Pa Why Nyu New Yo
009 Essay Example Yale Supplement Pa Why Nyu New YoKelly Lipiec
 
Thesis Statements And Controlling Ideas - Writing
Thesis Statements And Controlling Ideas - WritingThesis Statements And Controlling Ideas - Writing
Thesis Statements And Controlling Ideas - WritingKelly Lipiec
 
How To Teach A Child To Write
How To Teach A Child To WriteHow To Teach A Child To Write
How To Teach A Child To WriteKelly Lipiec
 
Writing An Expository Essay
Writing An Expository EssayWriting An Expository Essay
Writing An Expository EssayKelly Lipiec
 
Introduction Of Education Essay. Inclusive Education
Introduction Of Education Essay. Inclusive EducationIntroduction Of Education Essay. Inclusive Education
Introduction Of Education Essay. Inclusive EducationKelly Lipiec
 
Clouds Alphabet Stock Vector. Illustration Of Font, Letter - 1
Clouds Alphabet Stock Vector. Illustration Of Font, Letter - 1Clouds Alphabet Stock Vector. Illustration Of Font, Letter - 1
Clouds Alphabet Stock Vector. Illustration Of Font, Letter - 1Kelly Lipiec
 
Pin On Daily Inspiration William Hannah
Pin On Daily Inspiration William HannahPin On Daily Inspiration William Hannah
Pin On Daily Inspiration William HannahKelly Lipiec
 
Comparison Essay Sample. Compare And Contrast Es
Comparison Essay Sample. Compare And Contrast EsComparison Essay Sample. Compare And Contrast Es
Comparison Essay Sample. Compare And Contrast EsKelly Lipiec
 
Fundations Writing Paper With Picture S
Fundations Writing Paper With Picture SFundations Writing Paper With Picture S
Fundations Writing Paper With Picture SKelly Lipiec
 
003 Essay Example Why I Deserve This Scholarship
003 Essay Example Why I Deserve This Scholarship003 Essay Example Why I Deserve This Scholarship
003 Essay Example Why I Deserve This ScholarshipKelly Lipiec
 
Examples Of How To Conclude An Essay Evadon.Co.Uk
Examples Of How To Conclude An Essay Evadon.Co.UkExamples Of How To Conclude An Essay Evadon.Co.Uk
Examples Of How To Conclude An Essay Evadon.Co.UkKelly Lipiec
 
A Strong Outline For An Argumentative Essay Should Inclu
A Strong Outline For An Argumentative Essay Should IncluA Strong Outline For An Argumentative Essay Should Inclu
A Strong Outline For An Argumentative Essay Should IncluKelly Lipiec
 
Printable Writing Pages You Can Choose Traditional O
Printable Writing Pages You Can Choose Traditional OPrintable Writing Pages You Can Choose Traditional O
Printable Writing Pages You Can Choose Traditional OKelly Lipiec
 
The 8 Essentials Of Writing An Essay In Under 24 Hours
The 8 Essentials Of Writing An Essay In Under 24 HoursThe 8 Essentials Of Writing An Essay In Under 24 Hours
The 8 Essentials Of Writing An Essay In Under 24 HoursKelly Lipiec
 
Example Of A Qualitative Research Article Critique
Example Of A Qualitative Research Article CritiqueExample Of A Qualitative Research Article Critique
Example Of A Qualitative Research Article CritiqueKelly Lipiec
 
How To Write A Process Paper For Science Fair
How To Write A Process Paper For Science FairHow To Write A Process Paper For Science Fair
How To Write A Process Paper For Science FairKelly Lipiec
 
Rite In The Rain All Weather Writing Paper - Expedition Journal ...
Rite In The Rain All Weather Writing Paper - Expedition Journal ...Rite In The Rain All Weather Writing Paper - Expedition Journal ...
Rite In The Rain All Weather Writing Paper - Expedition Journal ...Kelly Lipiec
 

More from Kelly Lipiec (20)

Buy The Essay Buy Essay Online An
Buy The Essay Buy Essay Online AnBuy The Essay Buy Essay Online An
Buy The Essay Buy Essay Online An
 
Writing A Research Paper For Scholarly Journals
Writing A Research Paper For Scholarly JournalsWriting A Research Paper For Scholarly Journals
Writing A Research Paper For Scholarly Journals
 
Buy College Admission Essa
Buy College Admission EssaBuy College Admission Essa
Buy College Admission Essa
 
009 Essay Example Yale Supplement Pa Why Nyu New Yo
009 Essay Example Yale Supplement Pa Why Nyu New Yo009 Essay Example Yale Supplement Pa Why Nyu New Yo
009 Essay Example Yale Supplement Pa Why Nyu New Yo
 
Thesis Statements And Controlling Ideas - Writing
Thesis Statements And Controlling Ideas - WritingThesis Statements And Controlling Ideas - Writing
Thesis Statements And Controlling Ideas - Writing
 
How To Teach A Child To Write
How To Teach A Child To WriteHow To Teach A Child To Write
How To Teach A Child To Write
 
Writing An Expository Essay
Writing An Expository EssayWriting An Expository Essay
Writing An Expository Essay
 
Introduction Of Education Essay. Inclusive Education
Introduction Of Education Essay. Inclusive EducationIntroduction Of Education Essay. Inclusive Education
Introduction Of Education Essay. Inclusive Education
 
Clouds Alphabet Stock Vector. Illustration Of Font, Letter - 1
Clouds Alphabet Stock Vector. Illustration Of Font, Letter - 1Clouds Alphabet Stock Vector. Illustration Of Font, Letter - 1
Clouds Alphabet Stock Vector. Illustration Of Font, Letter - 1
 
Pin On Daily Inspiration William Hannah
Pin On Daily Inspiration William HannahPin On Daily Inspiration William Hannah
Pin On Daily Inspiration William Hannah
 
Comparison Essay Sample. Compare And Contrast Es
Comparison Essay Sample. Compare And Contrast EsComparison Essay Sample. Compare And Contrast Es
Comparison Essay Sample. Compare And Contrast Es
 
Fundations Writing Paper With Picture S
Fundations Writing Paper With Picture SFundations Writing Paper With Picture S
Fundations Writing Paper With Picture S
 
003 Essay Example Why I Deserve This Scholarship
003 Essay Example Why I Deserve This Scholarship003 Essay Example Why I Deserve This Scholarship
003 Essay Example Why I Deserve This Scholarship
 
Examples Of How To Conclude An Essay Evadon.Co.Uk
Examples Of How To Conclude An Essay Evadon.Co.UkExamples Of How To Conclude An Essay Evadon.Co.Uk
Examples Of How To Conclude An Essay Evadon.Co.Uk
 
A Strong Outline For An Argumentative Essay Should Inclu
A Strong Outline For An Argumentative Essay Should IncluA Strong Outline For An Argumentative Essay Should Inclu
A Strong Outline For An Argumentative Essay Should Inclu
 
Printable Writing Pages You Can Choose Traditional O
Printable Writing Pages You Can Choose Traditional OPrintable Writing Pages You Can Choose Traditional O
Printable Writing Pages You Can Choose Traditional O
 
The 8 Essentials Of Writing An Essay In Under 24 Hours
The 8 Essentials Of Writing An Essay In Under 24 HoursThe 8 Essentials Of Writing An Essay In Under 24 Hours
The 8 Essentials Of Writing An Essay In Under 24 Hours
 
Example Of A Qualitative Research Article Critique
Example Of A Qualitative Research Article CritiqueExample Of A Qualitative Research Article Critique
Example Of A Qualitative Research Article Critique
 
How To Write A Process Paper For Science Fair
How To Write A Process Paper For Science FairHow To Write A Process Paper For Science Fair
How To Write A Process Paper For Science Fair
 
Rite In The Rain All Weather Writing Paper - Expedition Journal ...
Rite In The Rain All Weather Writing Paper - Expedition Journal ...Rite In The Rain All Weather Writing Paper - Expedition Journal ...
Rite In The Rain All Weather Writing Paper - Expedition Journal ...
 

Recently uploaded

Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 

Recently uploaded (20)

Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 

An Improved Mining Of Biomedical Data From Web Documents Using Clustering

  • 1. International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611 Volume 5 Issue 2, February 2016 www.ijsr.net Licensed Under Creative Commons Attribution CC BY An Improved Mining of Biomedical Data from Web Documents Using Clustering Nikita Gupta1 , Gunjan Pahuja2 1 Department of Computer Science, A.P.J. Abdul Kalam University, Lucknow, U.P., India Abstract: Now a day’s web is the main source of information in every field. Web is also expanding exponentially day by day. To get the relevant information is very time consuming and is not a very easy task. Mainly users go for the various search engines to search any information. But sometimes search engines are not able to give the useful results as most of the web documents are present in unstructured manner. Data mining is extraction of information from large database. Text mining uses the many techniques of data mining. In web, biomedical documents are also increasing at a very fast pace but most of them are unstructured text. These documents can be very helpful in diagnostics, treatment and prevention of any disease. There are we millions of documents on internet about a specific term so to obtain a relevant document is very difficult. The goal of this is to apply text mining techniques to retrieve useful biomedical web documents. Here a more efficient mechanism is proposed which uses the optimised K-means clustering algorithm where it can group the similar documents in one place. This approach will help the user to get all the relevant biomedical documents at one place. On comparing our approach with the original k-means algorithm and found that our algorithm on an average giving 99.06 % F- measure. Keywords: Data Mining, Biomedical Data, Clustering, K-means. 1. Introduction Biomedical information search over search engines results in large number of query results. These results are not always relevant or sometimes fail to provide the information user looking for. Most of the explicit knowledge is stored in different types of documents but only a few people (often only the authors of the documents) know where to locate them. However, this information is often scattered among many web servers and hosts, using many different formats .The manual organization of these documents can be time- consuming to create and cannot usually be used in practice. Some websites use so many fake links to increase their importance. The famous search engine Google uses the ranking algorithm like PageRank [19] and HITS (Hypertext Induced Topic search) [20] to give the query results. Some web pages have too many links of advertisements, which may not relevant to the user. It is the user, and only the user who can say whether a given item of information is relevant to a query issued to a web search engine or to a private folder in which documents should be filtered according to the content. Biomedical terms are different from the normal English language words [1]. So to get the domain specific information is very difficult now days. Recently Biomedical Information is very increasing very rapidly on web. Most of this information on web is stored in unstructured way. The need of automatically retrieval of useful knowledge from the huge amount of textual data in order to assist the human analysis is fully needed. Data Mining discovers the previously unknown and potentially useful information from data in databases. Text mining is the extension of data mining. Text mining has been one of the fastest growing research fields for the past few decades. There are various text mining techniques available named by Information retrieval, Information extraction, text classification, text Clustering and so on [3]. For application developers, this interest is mainly due to the enormously increasing need to handle larger and larger quantities of biomedical documents, a need emphasized by increased connectivity and availability of document bases of all types at all levels in the information chain. Biomedical text mining generally has five phases: Text Gathering, Text Preprocessing, Data Analysis, Visualization, and Evaluation [18]. Text gathering aims to get desired text on a certain topic; In text preprocessing tokenization, part of speech tagging stopwords removal etc is done ; In data analysis actual information extraction is done ; Visualization is how to visual the results obtained; At last evaluation is done. Clustering is an unsupervised technique used to discover new set of categories. Here the proposed system gives an optimized k-means algorithm to make cluster of the biomedical documents so that the documents would be more relevant and informative. In this paper to identify the biomedical documents a UMLS (Unified Medical Language System) is used. This will conclude that if the document contains the biomedical term then the document is of biomedical domain. The proposed system finds out the biomedical terms in web documents and classifies the documents on the basis of it. Then the proposed system forms the cluster of similar and related documents using optimised K-means algorithm. Our approach will discard the non biomedical documents during classification of web documents. 2. Literature Survey The Literature [16] describes that World Wide Web is the most worthful resources for information retrieval due to increase in amount of information available online. Web Mining technologies are the correct solution for discovering the relevant data. The paper provides introduction of Web mining as well as discuss the web mining categories. Paper ID: NOV161318 1396
  • 2. International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611 Volume 5 Issue 2, February 2016 www.ijsr.net Licensed Under Creative Commons Attribution CC BY Mostly search engines are used for obtaining the relevant information. Search Engines are the programs that search documents related to the queries keyword and returns the list of documents where the keyword found. Pooja et al. [4] describes that Google uses the two link-based ranking algorithms, PageRank [19] and HITS [20] with its advantages and disadvantages. The literature [2] describes that the huge size of biomedical literature and its speedy growth in last few decades makes searching the relevant information a needy task. Obtaining and reading relevant information in literature is crucial for any researcher in life sciences. Fei et al. [5] discusses the text mining application in cancer research as cancer is a malignant disease and biomedical text has large value for its diagnostic, treatment and prevention. The paper also provides resources for cancer text mining and gives the general workflow of text mining in cancer system biology. Ravi et al. [2] gives an efficient approach using NLP to get correct biomedical document from web by counting the number of biomedical terms in the documents with the help of UMLS ontology and finally ranking the documents according to them. Wang et al. [11] proposed a similarity method for medical literature searches the normalised MEDLINE distance based on the biomedical literature knowledge source MEDLINE. This model is to refine the search process using user interests. User interests are analyzed to calculate semantic similarity among interest terms. A new text mining algorithm to increase the performance of information retrieval system for Medline and Pubmed has been proposed by Sagar et al. [6]. The unique term are weighted by using novel global relevant weighing schema. The biomedical text retrieval is the techniques applied to biomedical resources to extract information required as published biomedical research is very large. Sumit et al. [9] proposed a technique of soft clustering data mining algorithm to increase the accuracy of biomedical text extraction by discovering the predictive relationships between the different pieces of extracted data. Ontology is like dictionary or glossary but with more details like properties and interrelationships of the entities that exists for a particular domain. Tan et al. [12] showed that text mining of biomedical documents require substantial amount of domain knowledge. The paper proposes a model for choosing the most appropriate ontology for particular application. 3. Background 3.1 K-means Clustering Algorithm K-means is a widely used popular partitional and unsupervised technique used for document clustering due to its simplicity [21]. K-means algorithm uses an iterative process in order to cluster database. If the algorithm is required to produce K clusters then there will be K initial means and K final means. If K is the desired number of clusters, then partitional approaches typically find all K clusters at once. K-means is based on the idea that a center point can represent a cluster. K-means algorithm involves the following steps: Input: Collection of HTML documents. Output: Clusters of documents. Algorithm 1)Load all documents. 2)Remove stopwords and html tags. 3)Set the value of k. 4)Select k points at random as cluster centers. 5)Calculate the cosine similarity based on TF/IDF for each documents. 6)Assign documents to their closest cluster center according to the cosine similarity. 7)Recalculate the new cluster center. 8)If no document was reassigned then stop, otherwise repeat from step 6. 3.2 UMLS UMLS stands for the Unified Medical Language System is a collection of many controlled vocabularies in biomedical sciences [23]. UMLS is very large and contains terms related to biomedical and health. It is a wide range of general and specialized biomedical terminologies. It is used in variety of purposes like Information retrieval, Public Health Reporting, Clinical Health and Research. UMLS further provides facilities for natural language processing. It is intended to be used mainly by developers of system in medical informatics. The number of biomedical resources available to researchers is very large. Often the problem arises when the medical literature is searched as large volume of documents retrieved. The purpose of the UMLS is to enhance access to this literature by facilitating the development of computer systems that understand biomedical language. 3.3 Vector Space Model In this work documents are represented using the vector- space model (VSM) for clustering algorithm. VSM is most widely used algebraic model for representing text documents as vector of identifiers. In this model, each document, d, is considered to be a vector, d, in the term-space. Each document is represented by the (TF) vector, (tf1, tf2,…, tfn), where tfi is the frequency of the ith term in the document. In addition, we use the version of this model that weights each term based on its inverse document frequency (IDF) in the document collection. The weight associated with each keyword determines the relevance of the keyword in the document. The similarity between two documents can be measured in various ways. There are a number of possible measures for computing the similarity between the documents. The most common one is the cosine measure, which is defined as: Cosine-similarity (d1, d2) = (d1 • d2) / ||d1|| ||d2|| Paper ID: NOV161318 1397
  • 3. International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611 Volume 5 Issue 2, February 2016 www.ijsr.net Licensed Under Creative Commons Attribution CC BY Where • indicates the vector dot product and ||d|| is the length of vector d. The strength of the similarity depends on the value of Ɵ. For K-means clustering, the cosine measure is used to compute which document centroid is closest to a given document. 4. Proposed Model Following is our proposed system of mining the biomedical documents. Briefly describing the various phases as follows: 4.1 Document Collection Firstly collect several number of HTML documents using search engines. Then the HTML documents are converted to simple text documents .Text Documents are used to perform the further process. 4.2 Documents Preprocessing The collected documents are preprocessed to perform the effective documents search and to improve the efficiency of the system. There are abundant unnecessary texts on web documents like articles, be verbs, preposition, conjunction, and punctuations etc, which are categorized as stop words. There are large quantities of stop words appear in the document and typically have no use in this research. The stopword are removed and the documents are simplified. Figure1: Proposed System Arcitecture 4.3 Classification of Documents Here in this phase the documents are classified into two categories i.e. biomedical documents and Non-biomedical Documents. This classification is done to extract the biomedical documents from all the documents. Non- biomedical documents are discarded. The classification of documents is done with the help of the UMLS. 4.4 Clustering of the Documents The next step is to make cluster of similar and related documents. The improved K-means algorithm is used to make the clusters of biomedical documents. 5. Experimental Results To obtain the similar and related biomedical documents we have collected approximately 100 web documents using Google search engine. After preprocessing, documents are classified into biomedical documents and non-biomedical documents as shown in fig.2. Figure 2: Classification of Documents Non biomedical documents are discarded while Clusters of biomedical documents using K-means clustering. We tested the proposed system for both original K-means algorithm and improved K-means algorithm. We found that our improved approach takes less execution time. The comparisons of results are shown in fig.3. Figure 3: Showing Comparison of Execution time Three performance measures precision, recall and F-measure have been calculated for both the approaches. The tested results of performance measures for demonstration are shown in table 1. The use of this is to obtain the similar documents in a group. So that finding the relevant information is not time consuming. Paper ID: NOV161318 1398
  • 4. International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611 Volume 5 Issue 2, February 2016 www.ijsr.net Licensed Under Creative Commons Attribution CC BY Table 1: Comparison of Performance Measures 6. Conclusion The main aspect is retrieving the related biomedical documents in same group so that there is no need to go through all the documents. This paper presents improve the K-means algorithm so that it can perform clustering of biomedical documents in reasonable durations. It has been observed that selection of all words also affects the convergence time of K-means algorithm. The experimental results show that the execution time of the improved algorithm has become less or approximately half of the run time as in original. This shows a technique for selecting better initial points than random ones. We found that our approach is giving better performance results. This work can be further extended by ranking the documents in each cluster parallelization of K-means algorithm has been proposed as an improvement for the algorithm. In future we will consider the hierarchy of the biomedical documents in each cluster which will help the user to find all his documents in an organised and manner. References [1] Ravi Shankar Shukla, Kamendra Singh Yadav, Syed Tarif Rizvi, and Faisal Haseen, ”An Efficient Mining of Biomedical Data from Hypertext Documents via NLP,” Proc. of the 3rd Intell. Comput. (FICTA), Vol.1, pp. 651- 658, 2014. [2] Chung-Chi Huang and Zhiyong Lu, “Community challenges in biomedical text mining over 10 years: success, failure and the future,” Briefings in Bioinformatics Advance Access, May 1, 2015. [3] Dr. Shilpa Dang and Peerzada Hamid Ahmad,” A Review of Text Mining Techniques Associated with Various Application Areas,” International Journal of Science and Research (IJSR), Volume 4, no. 2, pp. 2461-2466, February, 2015. [4] Pooja Devi1, Ashlesha Gupta, and Ashutosh Dixit,” Comparative Study of HITS and PageRank Link based Ranking Algorithms,” International Journal of Advanced Research in Computer and Communication Engineering, Vol. 3, no. 2, February, 2014. [5] Fei Zhu , Preecha Patumcharoenpol , Cheng Zhang , Yang Yang , Jonathan Chan , Asawin Meechai , Wanwipa Vongsangnak , and Bairong Shen, “ Biomedical text mining and its applications in cancer research,” Journal of Biomedical Informatics, Vol. 46, no. 2, pp. 200–211, 2013. [6] S. Sagar Imambi and T. Sudha, “Extraction of Biomedical Information from MEDLINE Documents – A Text Mining Approach,” International Journal of Science, Environment and Technology, Vol. 2, no. 2, pp. 267 – 274, April, 2013. [7] Shally and Rejimol Robinson, “Survey for Mining Biomedical data from HTTP Documents,” International journal of Engineering Sciences & Research Technology, pp.165-169, Feburary, 2013. [8] Hui Yang and Yan Dong,” Recognizing Hierarchically Related Biomedical Entities Using MeSH-Based Mapping,” Tsinghua Science and Technology, Vol. 17, no. 6, pp. 609-618 December, 2012. [9] Sumit Vashishta and Dr. Yogendra Kumar Jain,” Efficient Retrieval of Text for Biomedical Domain using Data Mining Algorithm,” International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 2, No. 4, pp. 77-80, 2011. [10]Sazia Salah uddin and Rashedur M Rahman, “Mining Biomedical Data from Hypertext Documents, “ Proc. of 14th Int. Conf. on Computer and Information Technology , ICCIT 2011, pp. 417-422, Dhaka, Bangladesh , December 22-24, 2011. [11]WANG Yan, WANG Cong, ZENG Yi, HUANG Zhisheng, Vassil Momtchev, Bo Andersson, REN Xu ,and ZHONG Ning,” Normalized MEDLINE Distance in Context-Aware Life Science Literature Searches”, Tsinghua Science and Technology, Vol. 15, no. 6, lpp.709-715, December, 2010. [12]He Tan and Patrick Lambrix, “Selecting Ontology for Biomedical Text Mining,” Workshop on BioNLP, pp. 55–62, Boulder, Colorado, June 2009, Association for Computational Linguistics, 2009. [13]Saurav Sahay, Sougata Mukhaerjea, Eugene Agichtein, Ernest V. Garcia , Shamkant B. Navathe, and Ashwin Ram,” Discovering Semantic Biomedical Relations Utilizing the Web,” ACM-Transaction on knowledge Discovery from Data (TKDD), Vol. 2, no. 1, March, 2008. [14]SougataMukherjea and Saurav Sahay, “Discovering Biomedical Relations Utilizing the World-Wide Web,” Pacific Symposium on Biocomputing 11:164- 175(2006). [15]Rainer Malik and Arno Siebes, “CONAN: An Integrative System for Biomedical Literature Mining,” Progress in Artificial Intelligence, 12th Portuguese Conference on Artificial Intelligence, EPIA 2005, pp. 248 – 259, Dec. 5-8, 2005. [16]Miguel Gomes da Costa Junior and Zhiguo Gong,” Web Structure Mining: An Introduction,” Proc. of the 2005 IEEE Int. Conf. on Information Acquisition, pp. 590- 595, Hong Kong and Macau, China, june 27-july 3, 2005. [17]Wempu Xing and Ali Ghorbani, “Weighted PageRank Algorithm,” Proc. of the second Annual Conference on Communication Networks and Services Research (CNSR’04), May 19-21, 2004. [18]Brigitte Mathiak and Silke Eckstein, ”Five Steps to Text Mining in Biomedical Literature,” Proc. of the Second European Workshop on Data Mining and Text Mining in Bioimformatics, pp. 47-50, Pisa, Italy, September 24, 2004. Paper ID: NOV161318 1399
  • 5. International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2013): 6.14 | Impact Factor (2014): 5.611 Volume 5 Issue 2, February 2016 www.ijsr.net Licensed Under Creative Commons Attribution CC BY [19]Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry, "The Page Rank Citation Ranking: Bringing Order to the Web,” Technical Report. Stanford InfoLab, 1999. [20]Jon M. Kleinberg, "Authoritative sources in a hyperlinked environment,” J. ACM, Vol. 46, no. 5, pp. 604 -632, 1999. [21]K. A. Abdul Nazeer and M. P. Sebastian, “Improving the Accuracy and Efficiency of the k-means Clustering Algorithm,” Proceedings of the World Congress on Engineering, WCE 2009, Vol.1, pp.308-312, London, U.K., July 1 - 3, 2009. [22]H. Shatkay and R. Feldman, "Mining the Biomedical Literature in the Genomic Era: An Overview,” Journal of Computational Biology (JCB), Vo.10, Issue: 6, pp. 821-855, July 5, 2004. [23]Keith E. Cambell, Diane E. Oliver, and Edward H. Shortlife, “The Unified Medical Language System: Toward a collaborative Approach for Solving Terminologic problems,” Journal of the American Medical Informatics Association, Vol.5, no.1, Jan / Feb, 1998. [24]Https://www.nlm.nih.gov/research/umls/knowledge_sou rces/. Author Profile Nikita Gupta is a research scholar pursuing M.Tech in Computer Science & Engineering from JSS Academy of Technical Education, Noida, U.P., India. She has completed her B.tech in CSE from Mangalayatan University, Aligarh, U.P., India in 2012. Paper ID: NOV161318 1400