The biomedical research literature is one among many other domains that hides a precious knowledge, and
the biomedical community made an extensive use of this scientific literature to discover the facts of
biomedical entities, such as disease, drugs,etc.MEDLINE is a huge database of biomedical research
papers which remain a significantly underutilized source of biological information. Discovering the useful
knowledge from such huge corpus leads to various problems related to the type of information such as the
concepts related to the domain of texts and the semantic relationship associated with them. In this paper,
we propose a Two-level model for Self-supervised relation extraction from MEDLINE using Unified
Medical Language System (UMLS) Knowledge base. The model uses a Self-supervised Approach for
Relation Extraction (RE) by constructing enhanced training examples using information from UMLS. The
model shows a better result in comparison with current state of the art and naïve approaches
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This
paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean
model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the
researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing
one single patient’s information gathered from a large volume of data over a long period of time. The
main objective of this paper is to represent our ontology-based information retrieval approach for
clinical Information System. We have performed a Case Study in the real life hospital settings. The results
obtained illustrate the feasibility of the proposed approach which significantly improved the information
retrieval process on a large volume of data over a long period of time from August 2011 until January
2012
Biomedical indexing and retrieval system based on language modeling approachijseajournal
This summarizes a research paper that proposes a biomedical indexing and retrieval system called BIOINSY. It uses a language modeling approach to select the best Medical Subject Headings (MeSH) descriptors to index medical articles from sources like PUBMED. The system first preprocesses articles by splitting text, stemming words, and removing stop words. It then extracts terms using a hybrid linguistic and statistical approach. Terms are weighted based on semantic relationships in MeSH, not just statistics. Descriptors are selected by disambiguating terms and estimating the probability a descriptor was generated by the article's language model. Experiments showed the effectiveness of this conceptual indexing approach.
Domain ontology development for communicable diseasescsandit
This document discusses the development of a domain ontology for communicable diseases. The researchers developed an ontology with concepts like diseases, symptoms, and causes arranged in a taxonomy. They created over 600 concepts with properties and relations. The ontology development process included specification, conceptualization, creation of instances, and evaluation using a description logic reasoner to verify the concepts and relations were correctly represented. The ontology will be expanded to include more diseases and connections to related web content to provide information retrieval.
DOMAIN ONTOLOGY DEVELOPMENT FOR COMMUNICABLE DISEASEScscpconf
Web has become the very first resource to search for any kind of information. With the emergence of semantic web, our search queries have started generating more informed results.Ontologies are at the core of any semantic web application. They help in rapid development of
distributed systems by providing information on the fly. This key feature of distribution and
sharing of information has made ontologies as a new knowledge representation mechanism. A
mechanism which is strongly backed by a sound inference system. In this paper, we shall discuss the development, verification and validation of an ontology in a health domain.
This document discusses using a genetic algorithm to improve search visibility by expanding user queries. It explains that genetic algorithms can be applied to information retrieval by representing candidate solutions as chromosomes, evaluating their fitness, and evolving new generations through selection, crossover and mutation. The paper presents previous work applying genetic algorithms for query expansion and relevance feedback. It then describes the experiment conducted to implement a genetic algorithm over 500 generations to select optimal keywords for expanding queries and evaluate the approach on sample query results.
Predicting students' performance using id3 and c4.5 classification algorithmsIJDKP
An educational institution needs to have an approximate prior knowledge of enrolled students to predict
their performance in future academics. This helps them to identify promising students and also provides
them an opportunity to pay attention to and improve those who would probably get lower grades. As a
solution, we have developed a system which can predict the performance of students from their previous
performances using concepts of data mining techniques under Classification. We have analyzed the data
set containing information about students, such as gender, marks scored in the board examinations of
classes X and XII, marks and rank in entrance examinations and results in first year of the previous batch
of students. By applying the ID3 (Iterative Dichotomiser 3) and C4.5 classification algorithms on this data,
we have predicted the general and individual performance of freshly admitted students in future
examinations.
This document describes a study that developed an integrated biomedical ontology for extracting information from Medline abstracts about Alzheimer's disease. The ontology integrated the Gene Ontology and Medical Subject Headings by mapping gene names, GO terms, and MeSH keywords related to Alzheimer's. The integrated ontology was validated structurally, syntactically, and semantically. It was then used to discover significant associations between proteins, genes, and Alzheimer's disease extracted from Medline abstracts.
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This
paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean
model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the
researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing
one single patient’s information gathered from a large volume of data over a long period of time. The
main objective of this paper is to represent our ontology-based information retrieval approach for
clinical Information System. We have performed a Case Study in the real life hospital settings. The results
obtained illustrate the feasibility of the proposed approach which significantly improved the information
retrieval process on a large volume of data over a long period of time from August 2011 until January
2012
Biomedical indexing and retrieval system based on language modeling approachijseajournal
This summarizes a research paper that proposes a biomedical indexing and retrieval system called BIOINSY. It uses a language modeling approach to select the best Medical Subject Headings (MeSH) descriptors to index medical articles from sources like PUBMED. The system first preprocesses articles by splitting text, stemming words, and removing stop words. It then extracts terms using a hybrid linguistic and statistical approach. Terms are weighted based on semantic relationships in MeSH, not just statistics. Descriptors are selected by disambiguating terms and estimating the probability a descriptor was generated by the article's language model. Experiments showed the effectiveness of this conceptual indexing approach.
Domain ontology development for communicable diseasescsandit
This document discusses the development of a domain ontology for communicable diseases. The researchers developed an ontology with concepts like diseases, symptoms, and causes arranged in a taxonomy. They created over 600 concepts with properties and relations. The ontology development process included specification, conceptualization, creation of instances, and evaluation using a description logic reasoner to verify the concepts and relations were correctly represented. The ontology will be expanded to include more diseases and connections to related web content to provide information retrieval.
DOMAIN ONTOLOGY DEVELOPMENT FOR COMMUNICABLE DISEASEScscpconf
Web has become the very first resource to search for any kind of information. With the emergence of semantic web, our search queries have started generating more informed results.Ontologies are at the core of any semantic web application. They help in rapid development of
distributed systems by providing information on the fly. This key feature of distribution and
sharing of information has made ontologies as a new knowledge representation mechanism. A
mechanism which is strongly backed by a sound inference system. In this paper, we shall discuss the development, verification and validation of an ontology in a health domain.
This document discusses using a genetic algorithm to improve search visibility by expanding user queries. It explains that genetic algorithms can be applied to information retrieval by representing candidate solutions as chromosomes, evaluating their fitness, and evolving new generations through selection, crossover and mutation. The paper presents previous work applying genetic algorithms for query expansion and relevance feedback. It then describes the experiment conducted to implement a genetic algorithm over 500 generations to select optimal keywords for expanding queries and evaluate the approach on sample query results.
Predicting students' performance using id3 and c4.5 classification algorithmsIJDKP
An educational institution needs to have an approximate prior knowledge of enrolled students to predict
their performance in future academics. This helps them to identify promising students and also provides
them an opportunity to pay attention to and improve those who would probably get lower grades. As a
solution, we have developed a system which can predict the performance of students from their previous
performances using concepts of data mining techniques under Classification. We have analyzed the data
set containing information about students, such as gender, marks scored in the board examinations of
classes X and XII, marks and rank in entrance examinations and results in first year of the previous batch
of students. By applying the ID3 (Iterative Dichotomiser 3) and C4.5 classification algorithms on this data,
we have predicted the general and individual performance of freshly admitted students in future
examinations.
This document describes a study that developed an integrated biomedical ontology for extracting information from Medline abstracts about Alzheimer's disease. The ontology integrated the Gene Ontology and Medical Subject Headings by mapping gene names, GO terms, and MeSH keywords related to Alzheimer's. The integrated ontology was validated structurally, syntactically, and semantically. It was then used to discover significant associations between proteins, genes, and Alzheimer's disease extracted from Medline abstracts.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
This document evaluates the performance of structured and semi-structured tools for accessing bioinformatics databases. It compares the Sequence Retrieval System (SRS) and Entrez search tools for structured data retrieval to Perl and BioPerl programs for semi-structured data retrieval. The study retrieves gene information from the European Bioinformatics Institute and National Centre for Biotechnology Information databases using each method. It finds that semi-structured tools provide an alternative to structured tools, though each approach has advantages and disadvantages for certain types of queries.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
This document evaluates the performance of structured and semi-structured tools for accessing bioinformatics databases. It compares the Sequence Retrieval System (SRS) and Entrez search tools for structured data retrieval to Perl and BioPerl programs for semi-structured data retrieval. The study retrieves gene information from the European Bioinformatics Institute and National Center for Biotechnology Information databases using each method. It finds that semi-structured tools provide an alternative to structured tools, though each approach has advantages and disadvantages for certain types of queries.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
Evaluating the efficiency of rule techniques for file classificationeSAT Journals
Abstract Text mining refers to the process of deriving high quality information from text. It is also known as knowledge discovery from text (KDT), deals with the machine supported analysis of text. It is used in various areas such as information retrieval, marketing, information extraction, natural language processing, document similarity, and so on. Document Similarity is one of the important techniques in text mining. In document similarity, the first and foremost step is to classify the files based on their category. In this research work, various classification rule techniques are used to classify the computer files based on their extensions. For example, the extension of computer files may be pdf, doc, ppt, xls, and so on. There are several algorithms for rule classifier such as decision table, JRip, Ridor, DTNB, NNge, PART, OneR and ZeroR. In this research work, three classification algorithms namely decision table, DTNB and OneR classifiers are used for performing classification of computer files based on their extension. The results produced by these algorithms are analyzed by using the performance factors classification accuracy and error rate. From the experimental results, DTNB proves to be more efficient than other two techniques. Index Terms: Data mining, Text mining, Classification, Decision table, DTNB, OneR
This document proposes a new method to re-rank web documents retrieved by search engines based on their relevance to a user's query using ontology concepts. It involves building an ontology of concepts for a given domain (electronic commerce), extracting concepts from retrieved documents, and re-ranking documents based on the frequency of ontology concepts within them. An evaluation showed the approach reduced average ranking error compared to search engines alone. The method was tested on the first 30 documents retrieved for the query "e-commerce" from search engines.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...Editor IJCATR
Entrance of object orienting concept in database caused the relation database gradually to replace with object oriented
database in various fields. On the other hand for solving the problem of real world uncertain data, several methods were presented.
One of these methods for modeling database is an approach wich couples object-oriented database modeling with fuzzy logic. Many
queries that users to pose are expressed on the basis of linguistic variables. Because of classical databases are not able to support these
variables, leads to fuzzy approaches are considered. We investigate databases queries in this study both simple and complex ways. In
the complex way, we use conjunctive and disjunctive queries. In the following, we use the XML labels to express inqueries into fuzzy.
We can also communicate with other sections of software by entering into XML world as the most reliable opportunity. Also we want
to correct conjunctive and disjunctive queries related to fuzzy object oriented database using the concept of dependency measure and
weight, and weight be assigned to different phrases of a query based on user emphasis. The other aim of this research is mapping fuzzy
queries to fuzzy-XML. It is expected to be simple implement of query, and output of execution of queries be greatly closer to users'
needs and fulfill her expect. The results show that the proposed method explains the possible conjunctive and disjunctive queries the
database in the form of Fuzzy-XML.
This document provides a listing and brief descriptions of working papers from 2000. It includes 12 papers with titles and short 1-2 paragraph summaries of each paper's topic or focus. The papers cover a range of topics related to text mining, machine learning, data compression, knowledge discovery, and user interfaces for developing classifiers.
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
This document discusses the use of fuzzy queries to retrieve information from databases. Fuzzy queries allow for imprecise or vague terms to be used in queries, similar to natural language. The document first provides background on limitations of traditional database queries. It then discusses how fuzzy set theory and membership functions can be applied to queries and data to handle uncertain terms. The proposed approach applies fuzzy queries to a relational database, defining linguistic variables and membership functions. This allows information to be retrieved based on fuzzy criteria and improves the ability to query databases using human-like terms. Benefits of fuzzy queries include more natural interaction and accounting for real-world data imperfections.
Iaetsd a survey on one class clusteringIaetsd Iaetsd
This document presents a new method for performing one-to-many data linkage called the One Class Clustering Tree (OCCT). The OCCT builds a tree structure with inner nodes representing features of the first dataset and leaves representing similar features of the second dataset. It uses splitting criteria and pruning methods to perform the data linkage more accurately than existing indexing techniques. The OCCT approach induces a decision tree using a splitting criteria and performs prepruning to determine which branches to trim. It then compares entities to match them between the two datasets and produces a final result.
Nlp based retrieval of medical information for diagnosis of human diseaseseSAT Publishing House
This document describes a proposed natural language processing (NLP) system to retrieve medical information from clinical documents for disease diagnosis. The system would use NLP techniques like named entity recognition, part-of-speech tagging, and relationship extraction to process both clinical documents and user queries. For queries asking for disease information, the system would retrieve and score relevant documents, then output disease information. For queries describing symptoms, the system would attempt to output the corresponding disease name. The system would be implemented using modules for data extraction, processing, query analysis, document retrieval and scoring, and output filtering.
Inference Networks for Molecular Database Similarity SearchingCSCJournals
Molecular similarity searching is a process to find chemical compounds that are similar to a target compound. The concept of molecular similarity play an important role in modern computer aided drug design methods, and has been successfully applied in the optimization of lead series. It is used for chemical database searching and design of combinatorial libraries. In this paper, we explore the possibility and effectiveness of using Inference Bayesian network for similarity searching. The topology of the network represents the dependence relationships between molecular descriptors and molecules as well as the quantitative knowledge of probabilities encoding the strength of these relationships, mined from our compound collection. The retrieve of an active compound to a given target structure is obtained by means of an inference process through a network of dependences. The new approach is tested by its ability to retrieve seven sets of active molecules seeded in the MDDR. Our empirical results suggest that similarity method based on Bayesian networks provide a promising and encouraging alternative to existing similarity searching methods.
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing one single patient’s information gathered from a large volume of data over a long period of time. The main objective of this paper is to represent our ontology-based information retrieval approach for clinical Information System. We have performed a Case Study in the real life hospital settings. The results obtained illustrate the feasibility of the proposed approach which significantly improved the information retrieval process on a large volume of data over a long period of time from August 2011 until January 2012.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Ontology oriented concept based clusteringeSAT Journals
Abstract Worldwide health centre scientists, physicians and other patients are accessing, analyzing, integrating and storing massive amounts of digital medical data in different database. The potential for retrieval of information is vast and daunting. The objective of our approach is to differentiate relevant information from irrelevant through user friendly and efficient search algorithms. The traditional solution employs keyword based search without the semantic consideration. So the keyword retrieval may return inaccurate and incomplete results. In order to overcome the problem of information retrieval from this huge amount of database, there is a need for concept based clustering method in ontology. In the proposed method, WorldNet is integrated in order to match the synonyms for the identified keywords so as to obtain the accurate information and it presents the concept based clustering developed using k-means algorithm in accordance with the principles of ontology so that the importance of words of a cluster can be identified. Keywords: Ontology, Concept based clustering, K-means algorithm and information retrieval.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes a research paper that proposes a machine learning approach to identify disease-treatment relationships from biomedical text. It extracts sentences mentioning diseases and treatments from medical publications and classifies the semantic relationships between them. The researchers evaluate their methodology on a dataset of sentences annotated with diseases, treatments and their relationships. Their results show the machine learning models can reliably extract this information and outperform previous methods on the same data. The proposed approach could be integrated into applications to disseminate healthcare information from published literature to medical professionals and patients.
Nlp based retrieval of medical information for diagnosis of human diseaseseSAT Journals
Abstract NLP Based Retrieval of Medical Information is the extraction of medical data from narrative clinical documents. In this paper, we provide the way to diagnose diseases with the help of natural language interpretation and classification techniques. However extraction of medical information is difficult task due to complex symptom names and complex disease names. For diagnosis we will be using two approaches, one is getting disease names with the help of classifiers and another way is using the patterns with the help of NLP for getting the information related to diseases. These both approaches will be applied according to the question type. Keywords: NLP, narrative text, extraction, medical information, expert system
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
This document evaluates the performance of structured and semi-structured tools for accessing bioinformatics databases. It compares the Sequence Retrieval System (SRS) and Entrez search tools for structured data retrieval to Perl and BioPerl programs for semi-structured data retrieval. The study retrieves gene information from the European Bioinformatics Institute and National Centre for Biotechnology Information databases using each method. It finds that semi-structured tools provide an alternative to structured tools, though each approach has advantages and disadvantages for certain types of queries.
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
This document evaluates the performance of structured and semi-structured tools for accessing bioinformatics databases. It compares the Sequence Retrieval System (SRS) and Entrez search tools for structured data retrieval to Perl and BioPerl programs for semi-structured data retrieval. The study retrieves gene information from the European Bioinformatics Institute and National Center for Biotechnology Information databases using each method. It finds that semi-structured tools provide an alternative to structured tools, though each approach has advantages and disadvantages for certain types of queries.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
Evaluating the efficiency of rule techniques for file classificationeSAT Journals
Abstract Text mining refers to the process of deriving high quality information from text. It is also known as knowledge discovery from text (KDT), deals with the machine supported analysis of text. It is used in various areas such as information retrieval, marketing, information extraction, natural language processing, document similarity, and so on. Document Similarity is one of the important techniques in text mining. In document similarity, the first and foremost step is to classify the files based on their category. In this research work, various classification rule techniques are used to classify the computer files based on their extensions. For example, the extension of computer files may be pdf, doc, ppt, xls, and so on. There are several algorithms for rule classifier such as decision table, JRip, Ridor, DTNB, NNge, PART, OneR and ZeroR. In this research work, three classification algorithms namely decision table, DTNB and OneR classifiers are used for performing classification of computer files based on their extension. The results produced by these algorithms are analyzed by using the performance factors classification accuracy and error rate. From the experimental results, DTNB proves to be more efficient than other two techniques. Index Terms: Data mining, Text mining, Classification, Decision table, DTNB, OneR
This document proposes a new method to re-rank web documents retrieved by search engines based on their relevance to a user's query using ontology concepts. It involves building an ontology of concepts for a given domain (electronic commerce), extracting concepts from retrieved documents, and re-ranking documents based on the frequency of ontology concepts within them. An evaluation showed the approach reduced average ranking error compared to search engines alone. The method was tested on the first 30 documents retrieved for the query "e-commerce" from search engines.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...Editor IJCATR
Entrance of object orienting concept in database caused the relation database gradually to replace with object oriented
database in various fields. On the other hand for solving the problem of real world uncertain data, several methods were presented.
One of these methods for modeling database is an approach wich couples object-oriented database modeling with fuzzy logic. Many
queries that users to pose are expressed on the basis of linguistic variables. Because of classical databases are not able to support these
variables, leads to fuzzy approaches are considered. We investigate databases queries in this study both simple and complex ways. In
the complex way, we use conjunctive and disjunctive queries. In the following, we use the XML labels to express inqueries into fuzzy.
We can also communicate with other sections of software by entering into XML world as the most reliable opportunity. Also we want
to correct conjunctive and disjunctive queries related to fuzzy object oriented database using the concept of dependency measure and
weight, and weight be assigned to different phrases of a query based on user emphasis. The other aim of this research is mapping fuzzy
queries to fuzzy-XML. It is expected to be simple implement of query, and output of execution of queries be greatly closer to users'
needs and fulfill her expect. The results show that the proposed method explains the possible conjunctive and disjunctive queries the
database in the form of Fuzzy-XML.
This document provides a listing and brief descriptions of working papers from 2000. It includes 12 papers with titles and short 1-2 paragraph summaries of each paper's topic or focus. The papers cover a range of topics related to text mining, machine learning, data compression, knowledge discovery, and user interfaces for developing classifiers.
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
This document discusses the use of fuzzy queries to retrieve information from databases. Fuzzy queries allow for imprecise or vague terms to be used in queries, similar to natural language. The document first provides background on limitations of traditional database queries. It then discusses how fuzzy set theory and membership functions can be applied to queries and data to handle uncertain terms. The proposed approach applies fuzzy queries to a relational database, defining linguistic variables and membership functions. This allows information to be retrieved based on fuzzy criteria and improves the ability to query databases using human-like terms. Benefits of fuzzy queries include more natural interaction and accounting for real-world data imperfections.
Iaetsd a survey on one class clusteringIaetsd Iaetsd
This document presents a new method for performing one-to-many data linkage called the One Class Clustering Tree (OCCT). The OCCT builds a tree structure with inner nodes representing features of the first dataset and leaves representing similar features of the second dataset. It uses splitting criteria and pruning methods to perform the data linkage more accurately than existing indexing techniques. The OCCT approach induces a decision tree using a splitting criteria and performs prepruning to determine which branches to trim. It then compares entities to match them between the two datasets and produces a final result.
Nlp based retrieval of medical information for diagnosis of human diseaseseSAT Publishing House
This document describes a proposed natural language processing (NLP) system to retrieve medical information from clinical documents for disease diagnosis. The system would use NLP techniques like named entity recognition, part-of-speech tagging, and relationship extraction to process both clinical documents and user queries. For queries asking for disease information, the system would retrieve and score relevant documents, then output disease information. For queries describing symptoms, the system would attempt to output the corresponding disease name. The system would be implemented using modules for data extraction, processing, query analysis, document retrieval and scoring, and output filtering.
Inference Networks for Molecular Database Similarity SearchingCSCJournals
Molecular similarity searching is a process to find chemical compounds that are similar to a target compound. The concept of molecular similarity play an important role in modern computer aided drug design methods, and has been successfully applied in the optimization of lead series. It is used for chemical database searching and design of combinatorial libraries. In this paper, we explore the possibility and effectiveness of using Inference Bayesian network for similarity searching. The topology of the network represents the dependence relationships between molecular descriptors and molecules as well as the quantitative knowledge of probabilities encoding the strength of these relationships, mined from our compound collection. The retrieve of an active compound to a given target structure is obtained by means of an inference process through a network of dependences. The new approach is tested by its ability to retrieve seven sets of active molecules seeded in the MDDR. Our empirical results suggest that similarity method based on Bayesian networks provide a promising and encouraging alternative to existing similarity searching methods.
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing one single patient’s information gathered from a large volume of data over a long period of time. The main objective of this paper is to represent our ontology-based information retrieval approach for clinical Information System. We have performed a Case Study in the real life hospital settings. The results obtained illustrate the feasibility of the proposed approach which significantly improved the information retrieval process on a large volume of data over a long period of time from August 2011 until January 2012.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Ontology oriented concept based clusteringeSAT Journals
Abstract Worldwide health centre scientists, physicians and other patients are accessing, analyzing, integrating and storing massive amounts of digital medical data in different database. The potential for retrieval of information is vast and daunting. The objective of our approach is to differentiate relevant information from irrelevant through user friendly and efficient search algorithms. The traditional solution employs keyword based search without the semantic consideration. So the keyword retrieval may return inaccurate and incomplete results. In order to overcome the problem of information retrieval from this huge amount of database, there is a need for concept based clustering method in ontology. In the proposed method, WorldNet is integrated in order to match the synonyms for the identified keywords so as to obtain the accurate information and it presents the concept based clustering developed using k-means algorithm in accordance with the principles of ontology so that the importance of words of a cluster can be identified. Keywords: Ontology, Concept based clustering, K-means algorithm and information retrieval.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes a research paper that proposes a machine learning approach to identify disease-treatment relationships from biomedical text. It extracts sentences mentioning diseases and treatments from medical publications and classifies the semantic relationships between them. The researchers evaluate their methodology on a dataset of sentences annotated with diseases, treatments and their relationships. Their results show the machine learning models can reliably extract this information and outperform previous methods on the same data. The proposed approach could be integrated into applications to disseminate healthcare information from published literature to medical professionals and patients.
Nlp based retrieval of medical information for diagnosis of human diseaseseSAT Journals
Abstract NLP Based Retrieval of Medical Information is the extraction of medical data from narrative clinical documents. In this paper, we provide the way to diagnose diseases with the help of natural language interpretation and classification techniques. However extraction of medical information is difficult task due to complex symptom names and complex disease names. For diagnosis we will be using two approaches, one is getting disease names with the help of classifiers and another way is using the patterns with the help of NLP for getting the information related to diseases. These both approaches will be applied according to the question type. Keywords: NLP, narrative text, extraction, medical information, expert system
A Critical Survey On Current Literature-Based Discovery ModelsDon Dooley
This document provides a critical survey of current literature-based discovery (LBD) models. It discusses several categories of LBD methods, including statistical/probabilistic methods, vector space/algebraic methods, knowledge-based methods, inference network methods, intellectual structure analysis, and fuzzy sets theory. For each category, it summarizes key proposals and approaches that fall under that category. The overall purpose is to highlight the advantages and disadvantages of currently available LBD methodologies and provide a classification of existing LBD methods.
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...CSCJournals
This document is the front matter of the International Journal of Biometrics and Bioinformatics (IJBB) Volume 2, Issue 1 published on February 28, 2008. It includes information about the editor in chief, copyright details, a table of contents listing one paper, and brief descriptions of the paper titled "Inference Networks for Molecular Database Similarity Searching" which explores using Bayesian networks for molecular similarity searching in chemical databases.
An approach for transforming of relational databases to owl ontologyIJwest
Rapid growth of documents, web pages, and other types of text content is a huge challenge for the modern content management systems. One of the problems in the areas of information storage and retrieval is the lacking of semantic data. Ontologies can present knowledge in sharable and repeatedly usable manner and provide an effective way to reduce the data volume overhead by encoding the structure of a particular domain. Metadata in relational databases can be used to extract ontology from database in a special domain. According to solve the problem of sharing and reusing of data, approaches based on transforming relational database to ontology are proposed. In this paper we propose a method for automatic ontology construction based on relational database. Mining and obtaining further components from relational database leads to obtain knowledge with high semantic power and more expressiveness. Triggers are one of the database components which could be transformed to the ontology model and increase the amount of power and expressiveness of knowledge by presenting part of the knowledge dynamically.
How to conduct_a_systematic_or_evidence_reviewEaglefly Fly
This document provides guidance on conducting a systematic or evidence-based literature review. It discusses defining search terms, identifying relevant articles through database searches and other methods, applying inclusion/exclusion filters to evaluate articles, synthesizing results, and summarizing the evidence found to determine the best intervention. The goal is to reduce bias and provide a comprehensive review of a topic through an explicit and transparent process.
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...ijcsa
This document summarizes a research paper that introduces a text mining-based method for answering biological queries and testing hypotheses. The proposed approach analyzes hypotheses stated as natural language questions and measures their statistical significance based on existing literature. It computes a p-value to determine whether to accept or reject each hypothesis. The method also generates a network of related biological entities to provide context and suggest new hypotheses for further investigation. The goal is to help researchers quantitatively evaluate assumptions and guide relevant discovery of new biological knowledge.
Automatic Generation of Multiple Choice Questions using Surface-based Semanti...CSCJournals
Multiple Choice Questions (MCQs) are a popular large-scale assessment tool. MCQs make it much easier for test-takers to take tests and for examiners to interpret their results; however, they are very expensive to compile manually, and they often need to be produced on a large scale and within short iterative cycles. We examine the problem of automated MCQ generation with the help of unsupervised Relation Extraction, a technique used in a number of related Natural Language Processing problems. Unsupervised Relation Extraction aims to identify the most important named entities and terminology in a document and then recognize semantic relations between them, without any prior knowledge as to the semantic types of the relations or their specific linguistic realization. We investigated a number of relation extraction patterns and tested a number of assumptions about linguistic expression of semantic relations between named entities. Our findings indicate that an optimized configuration of our MCQ generation system is capable of achieving high precision rates, which are much more important than recall in the automatic generation of MCQs. Its enhancement with linguistic knowledge further helps to produce significantly better patterns. We furthermore carried out a user-centric evaluation of the system, where subject domain experts from biomedical domain evaluated automatically generated MCQ items in terms of readability, usefulness of semantic relations, relevance, acceptability of questions and distractors and overall MCQ usability. The results of this evaluation make it possible for us to draw conclusions about the utility of the approach in practical e-Learning applications.
This document summarizes a research paper that proposes using machine learning algorithms and natural language processing techniques to extract disease and treatment information from medical texts. It discusses using a pipeline of tasks including identifying relevant sentences, representing the sentences, and classifying the relationships between diseases and treatments. The paper reviews previous work on using algorithms like naive Bayes and conditional naive Bayes for information extraction and relation classification. It proposes applying machine learning and natural language processing to biomedical texts from sources like Medline to automatically extract symptoms, causes, and treatments for diseases specified in user queries. This extracted information could help doctors and patients by providing structured medical knowledge in a time-saving manner.
Research on ontology based information retrieval techniquesKausar Mukadam
The document summarizes and compares three novel ontology-based information retrieval techniques. It discusses a technique for retrieving information in the domain of Traditional Chinese Medicine that uses an ontology to represent concepts and measures concept similarity to sort search results. It also describes a framework for semantic indexing and querying that uses an ontology and entity-attribute-value model to improve scalability, usability, and retrieval performance for transport systems. Additionally, it outlines a semantic extension retrieval model that uses ontology annotation and semantic extension of queries to address limitations of keyword-based search. The techniques are evaluated based on precision and recall measures to analyze their effectiveness compared to traditional methods.
Information Retrieval on Text using Concept Similarityrahulmonikasharma
This document summarizes a research paper on concept-based information retrieval using semantic analysis and WordNet. It discusses some of the challenges with keyword-based retrieval, such as synonymy and polysemy problems. Concept-based retrieval aims to address these issues by mapping documents and queries to semantic concepts rather than keywords. The paper proposes extracting concepts from text documents using WordNet to identify synonyms, hypernyms and hyponyms. It involves calculating term frequencies to determine a hierarchy of important concepts. The methodology is implemented using Java and WordNet to extract concepts from sample input documents.
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringKelly Lipiec
This document summarizes a research paper that proposes an improved method for mining biomedical data from web documents using clustering. Specifically, it develops an optimized k-means clustering algorithm to group similar biomedical documents together based on identifying relevant terms using the Unified Medical Language System (UMLS). The approach aims to more efficiently retrieve relevant biomedical documents for users. It compares the proposed method to the original k-means algorithm and finds it achieves an average F-measure of 99.06%, indicating more accurate clustering of biomedical web documents.
Similar to TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS (20)
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...EduSkills OECD
Andreas Schleicher, Director of Education and Skills at the OECD presents at the launch of PISA 2022 Volume III - Creative Minds, Creative Schools on 18 June 2024.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
DOI : 10.5121/ijdkp.2016.6302 11
TWO LEVEL SELF-SUPERVISED RELATION
EXTRACTION FROM MEDLINE USING UMLS
Huda Banuqitah, Fathy Eassa, Kamal Jambi and Maysoon Abulkhair
Faculty of Computing & Information Technology,
King Abdulaziz University, Jeddah- Saudi Arabia
ABSTRACT
The biomedical research literature is one among many other domains that hides a precious knowledge, and
the biomedical community made an extensive use of this scientific literature to discover the facts of
biomedical entities, such as disease, drugs,etc.MEDLINE is a huge database of biomedical research
papers which remain a significantly underutilized source of biological information. Discovering the useful
knowledge from such huge corpus leads to various problems related to the type of information such as the
concepts related to the domain of texts and the semantic relationship associated with them. In this paper,
we propose a Two-level model for Self-supervised relation extraction from MEDLINE using Unified
Medical Language System (UMLS) Knowledge base. The model uses a Self-supervised Approach for
Relation Extraction (RE) by constructing enhanced training examples using information from UMLS. The
model shows a better result in comparison with current state of the art and naïve approaches.
KEYWORDS
Relation Extraction, Self-supervised, Machine Learning, Knowledge base.
1. INTRODUCTION
In the last two decades, usage of medical computing systems showed an explosive growth. The
vast amount of Information they store potentially contains new knowledge that can provide
decision support to enhance the quality of medical care. An enormous database and repository of
biomedical literature is available for researcher community and may contain the required
knowledge. MEDLINE is one example of the online bibliographic database from a biomedical
domain that contains more than 22 million biomedicine journal articles [1]. Knowledge
Discovery from Databases (KDD) from such a biomedical courps as MEDLINE is a complicated
process, and it takes several processes [2]. The efficient exploitations of these resources require
Information Extraction (IE) techniques that transform unstructured information into the
structured form. An example of such techniques is Relation Extraction (RE) which is an
automatic mining of relations between biomedical entities in text. The extraction of the
relationship between biomedical entities is the process to determine the semantic link between
those entities and characterizing the nature of this relationship[1]. Recently RE has found
growing interest among IE community and many studies focused on it because it helps to find
new relations and interactions between biomedical entities from raw text and minimize
intervention of a human resource. RE includes multiple techniques such as rule–based approach,
Natural Language Processing (NLP) and Machine Learning (ML) methods [3, 4]. There are three
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
12
main types of RE approaches which are: Unsupervised method which needs no labeling,
Supervised that uses a corpus of labelled data and Self-supervised that uses a small set of labelled
examples. The Unsupervised method extracts strings of words that exist between the entities in
large amounts of text, and then clusters and simplifies these word strings to produce relation.
Unsupervised methods can use enormous quantities of data and extract very large numbers of
relationships, but the resulting relations may not be easy to map to relations needed for a
particular knowledge base.
On the other hand, Supervised relation extraction method uses ML techniques to address this
problem, which requires a sufficiently annotated training data that consist of positive and
negative examples. Furthermore, constructing the annotated data set for training is expensive,
required expert knowledge and consumed plenty of time. Where the Self-supervised approach
overcomes this bottleneck by using a significant knowledge base which contains information
about the target relation to automatically annotate a data set. The main assumptions are the
sentences contain an entity pairs either representing or not representing a relation will also
express the relationship as well. Furthermore, Self-supervised approaches combine the
advantages of supervised approaches, by including noisy pattern features in a probabilistic
classifier, and advantage of Unsupervised methods, by extracting large numbers of relations
from large corpora. It is generally believed that Self-supervised techniques would benefit the
relation extraction in a generic domain. However, these techniques are not fully explored in the
biomedical domain, because of two reasons. The first reason is that the main source of knowledge
of Self-supervised approaches is Freebase, which is for the general domain and lack of
biomedical knowledge. The second reason is, the developed Self-supervised learning models
assume that each entity instance is independent which is violated and not applicable in the
biomedical domain[5].
Thus, we proposed a biomedical relational model of Two-level for automatic Self-supervised
Relation Extraction from Biomedical domain using UMLS. As we mentioned previously, KDD
is iterative and interactive multiphase processes that include different steps like data selection,
data preparation and preprocessing, data transformation, Data Mining (DM) and evaluation
process. For that, we developed our model to discover knowledge from the biomedical domain by
integrated diverse techniques for data mining including Self-supervision, natural language
processing, and machine learning to build a Relation Extraction system from MEDLINE that
requires minimal supervision using Unified Medical Language system (UMLSi
).
UMLS is a collection of files and software that include different biomedical knowledge base and
vocabularies. Metathesaurus is a database in UMLS that contains millions of health and
biomedical related concepts names and the relationship between them. All the concepts
categorized by their semantic typeii
and all names of the concept are unified by Concept Unique
Identifier (CUI). MRRELiii
is a subset of the Metathesaurus and contains different relationships
between different biomedical concepts defined by a pair of CUIs.
The aim of this paper is to enhance the Self-supervised Relation Extraction in MEDLINE
biomedical domain by using semantic types of entities from UMLS knowledge base to construct
training examples. Our contribution to the overall solution, based on a mature architecture and
with a proof-of-concept implementation by using a semantic type of concept to construct the
examples of the relation of interest that as training examples of the Self-supervised approach that
improves the result comparing with others in term of Precision, Recall and F-Score performance.
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
13
The rest of the paper organized as the follow. Section 2 presents the related work of the study,
while section 3, describe the details of our knowledge discovery model. Section 4 shows the
experiments procedure, with methods of training and test set construction. Section 5 shows the
results and the discussion while the final section is the conclusion and future work.
2. RELATED WORK
Current biomedical research needs to exploit the enormous amount of information reported in the
scientific literature using DM techniques. In particular, those techniques aimed at finding
relationships between entities, which is the key for identification of actionable knowledge from
these kinds of literature which are called Relation Extraction (RE). This section presents the
different efforts that have been achieved in RE from the biomedical domain which used Self-
supervised approach.
Figure 1: Self-Supervised workflow[6].
The author in [6] presented a general distant supervision approach for relationship extraction as
shown in Figure 1. The details are summarized in the following steps:
1. Identify a knowledge base which includes pairs of entities about the relationship-type in
question (e.g., PPI-database).
2. Compile a large text (not annotated) resource relevant for the target domain (e.g.,
MEDLINE abstracts).
3. Recognize and normalize related named entities (e.g., protein names).
4. Associate entity-pairs from the knowledge base with previously identified instances in
the text corpus.
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
14
5. Entity pairs contained in the knowledge base are labelled as positive examples. Negative
examples are labelled by following the closed world assumption. The closed world
assumption states that entity pairs lacking in the knowledge base do not feature the
relationship type in question.
There are limited works which used Self-supervised approaches in biomedical domain, the
authors in [7] proposed a Self-supervised relation extraction by using Yeast Protein Database
(YPD) that contains subcellular location fields for many proteins, they collected a set of instances
of subcellular locations of proteins from the Yeast Protein Database and then identified sentences
from the associated PubMed abstracts in order to get an annotated corpora.
By using the coordination structure of an entity in the sentences, [5] developed a distant
supervised model that combine the results from open information extraction techniques, to
perform a task of relation extraction from biomedical literature. The model incorporates a
grouping strategy to take into consideration the coordinating structure among entities co-occurred
in one sentence. They apply the approach to extract gene expression relationship between genes
and brain regions from literature. The Results showed that the methods can achieve better
performance over baselines of Transductive Support Vector Machine and with non-grouping
strategy.
In [8] the authors use Self-supervision learning to train a classifier for Protein-Protein
Interactions (PPI). They use a Support Vector Machine (SVM) classification algorithm with a
shallow linguistic kernel as a classifier. The knowledge about interacting proteins is taken from
the IntAct database.
Using UMLS as Knowledge base, the authors in [9] proposed a Self-supervised relation
extraction from the biomedical domain in MEDLINE abstracts using UMLS to annotate
automatically the training data which is then used to train the classifier. To generate the training
examples with positive and negative examples, all CUI pairs for the target relation are extracted
from MRREL and considered as a set of positive instance pairs. Thus, the occurrence of a
positive entity pair in a sentence will represent the relation of interest. Any CUI pairs which also
occur in another MRREL relations are removed from the list of positive instance pairs. In
contrast, negative examples will be generated based on the positive instance pair set; new CUI
pair combinations will be generated by combining all CUIs from the first position with all CUIs
from the second position. Only if a newly generated CUI pair is not in the positive list and not
contained in another MRREL relation, then it will be used as negative instance pair. The model
evaluated using two techniques Held-out and manual evaluation. On manual evaluation, the
relation classifier was trained using (may_treat) relation that created using Self-supervised
process and evaluated by using manually annotated corpus using test data set, and the result
outperforms naïve approach with an F-Score of 0.571, 0.600 Precision and 0.545 Recall. The
result indicated that UMLS is a useful resource for Self-supervised relation extraction. Also by
using UMLS to train a distant supervised relational classifier, [10] presented the first results
using UMLS knowledge base and the model evaluated using existing evaluation data sets since
there were no resources directly annotated with UMLS relations available. Their results showed
that using a distantly supervised classifier trained on MRREL relations similar to those found in
the evaluation data set provides promising results.
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
15
The authors in [11] demonstrated the potential of distant learning in constructing a fully
automated relation extraction process. They produced two distantly labelled corpora for protein–
protein and drug to drug interaction extraction, with knowledge found in databases such as IntAct
for genes and Drug Bank for drugs. They labelled approximately 50,000 MEDLINE abstracts
using the shallow linguistic classifier trained on a distantly labelled corpus. In other words, the
classifier trained on five manually annotated corpora and the same classifier trained on a distantly
labelled corpus agree on 86.4 % of all 50,000 predictions.
There are some works done in Self-supervised approach outside the biomedical domain. Mintz
and others in [12] use Freebase to provide distant supervision for relation extraction. They
utilized a similar heuristic by matching Freebase tuples with unstructured sentences from
Wikipedia articles in their experiments to create features for learning relation extractors.
Matching Freebase with arbitrary sentences instead of matching Wikipedia infobox with
corresponding Wikipedia articles will potentially increase the size of matched sentences at a cost
of accuracy. They conclude that their results suggest that syntactic features are indeed useful in
distantly supervised information extraction. Also, the authors [13] used Freebase knowledge base
to annotate the New York Times corpus with the entity pairs. They focused on the three relations
which are nationality, place of birth, and contains. To train the classifier, the authors introduced
the usage of a multi-instance learning approach for this context. In contrast, the authors in [14]
annotated the information in the articles of Wikipedia using the infoboxes of Wikipedia as a
knowledge source.
3. MODEL DESCRIPTION
3.1. Model Architecture
The First level deals with data extraction, preparation and relation examples extraction by using
UMLS knowledge base. The second level deals with feature extraction, constructing the training
set and then train and evaluate the classifier using SVM classification to extract the relation of
unlabelled data, SVM found to be the best and for multiple classification problems[15].
As shown in Figure 2 of our model architecture, the user enters the query using the system
interface then the model coordinates execution of user query and displays the result. Then, the
sentences that contain a pair of entities with a semantic type that matches the user query are
retrieved. We use MySQL database in our model to store and query annotated sentences in
addition to retrieve relation examples that match user query from the UMLS knowledge base.
The features extraction from the sentences is done by tokenizing, lemmatizing and parsing
sentences; we built our model based on the CoreNLP library [16]. Finally, constructing the
training examples that will be used by the classifier to train the data set using the Linear SVM
classification algorithm and then extract relation for the unlabelled sentences.
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
16
Figure 2: Model Architecture
3.2 Model Workflow
The whole process of our approach can be detailed as follows:
3.2.1 Pre-processing
We used MetaMapped MEDLINE corpusiv
as initial data, which consists of articles and abstracts
of millions of research papers and publication from medicine. The sentences of MEDLINE
contain the information of interest are used to generated the training data set examples for the
Self-supervised model. Therefore, it is important to identify related information. The relations in
UMLS are identified by a pair of Concept Unique Identifier (CUI), so we need a mapping of
UMLS concepts to the MEDLINE sentences. For that, we used a MetaMapped MEDLINE,
which is annotated by MetaMap toolv
. Each sentence in MEDLINE annotated with UMLS
concepts, and the annotations are represented in MetaMap machine output formatvi
.
The whole MetaMapped MEDLINE is about 165 GB in size and can be downloaded in two
ways: as one file of the entire corpus or as 779 small files each about 250 MB in volume. In our
experiment, we used a subset of those files.
The basic unit of MetaMap machine output is an utterance that represents annotation for a single
sentence. Each utterance consists of phrases – subset of initial sentence. For each phrase,
MetaMap provides several different mappings. Each mapping consists of several entity values
that represent matched concept and score assigned by MetaMap tool. The value of the assigned
score is reflected the mapping confidence in which the lower the score value means higher
confidence in mapping.
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
17
In more details we populated the database with only mappings that match our test user query and
have the best score. We use two semantic types which are "bacs" and "dsyn" pairs, where “bacs”
is Biologically Active Substance and “dsyn” is refer to Disease or Syndrome. The database was
populated with more than 503151 sentences from initial 30 GB of MetaMapped MEDLINE file.
3.2.2 Feature extraction
We adopted the same features implemented by [10, 12, 17] because they clearly represent the
relation between the entities in the sentence also they help in determining the accurate class of the
relation between disease and treatment. The adopted features are: The sequence of words between
entities, Post Of Speech tag (POS) of words between entities and Words on the semantic path
between entities. For constructing the above lexical and syntactic features, we annotated each
sentence in the training set with part of speech tags and dependency tree using Stanford CoreNLP
library. Consider the following example sentence: “Multiple doses of METHOTREXATE used in
the treatment of ECTOPICUPREGNANCY”, where METHOTREXATE is the first entity in the
sentence And ECTOPICUPREGNANCY is the second entity in the sentence .
Figure 3: parsing tree of sentences notes that: NSUBJ is Nominal subject, PREP is Prepositional modifier,
PUNCT is Punctuation, POBJ is Object of a preposition,NNS is Noun- plural, ADP is adpositions
(prepositions and postpositions) and ORG – organization.
Figure 3 show the parsing tree of the sentences where:
1. Words between entities feature for example sentence will be: “used in the treatment of”.
2. Words on the semantic path between entities. The semantic path is a path on the semantic
graph (arrows on Figure 3). For example sentence, we start from first entity than go to by
the arrow to word “of”, than to “multiple doses”, than to “used”, than to “in”, than to “the
treatment”, “of” and finally to second entity”.
3.2.3 Data set annotations and relation extraction
For data set annotation with relations, we used MRREL relation subset of UMLS knowledge
base. Each entry in MRREL contains a pair of entities and relation between them in form
(CUI_1, CUI_2, relation name). The relation annotating algorithm works in two steps. First, it
looks for all entries in MRREL that match the semantic types from user query. Second, it
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
18
matches those entries to sentences by CUI. If sentence contains the same pair of CUIs as MRREL
entry it is annotated with corresponding relation. For relation extraction, we trained and evaluated
our classifier using SVM classification algorithm on our data set to extract and predict the
relations between disease and treatment for the rest of unlabelled sentences. The classification
process is based on the extracted feature that we mentioned previously that constitute the features
vectors used to train the classifier.
4. EXPERIMENT
4.1. Training set construction
The training set was constructed from sentences that matched MRREL relations with our own
method as mentioned in section 3.2.3 and inspired by [9]. However, our model differs in two
aspects: we used a semantic type of the entities to get all the relations between the biomedical
entities in UMLS, and we used general relation examples that appear between our given semantic
types to construct the negative examples. In contrast authors of [9] used only pairs that
participate in ”may_treat” relation regardless of their semantic type.
To enhance the training set quality, we applied filtering by part of speech tag. MetaMap tool has
a most common error that is annotating verbs or adjectives as if they were nouns as observed by
manual check. Using CoreNLP library as in [16] we annotated each sentence in the training set
with part of speech tags and threw away those sentences which concept was not marked as nouns.
For the training set labeling, all relations were divided into two groups: specific relations that
labelled with “RO” in MRREL, where RO relation described as a relationship other than
synonymous, narrower, or broader, and other than RO relation groups that represent more general
relations. General relations were considered as negative examples for classification and labelled
as "other." Sentences with multiple "RO" relations were not included in a training set because
they could represent any of those relations but classifier needs the exact match with label and
ground truth. We also discard non-frequent relations.
Another observation was that “RO may_treat” relation almost include“RO=may_prevent”
relation and all most of the sentences labelled with “may_prevent” were also labelled with
“may_treat”. Manual analysis showed that ground truth for such sentence could be either of both
relations as shown in example 1 that the treatment “desferrioxamine” treats the” iron overload”,
and they are indistinguishable by MRREL. We decided to unite such relations into one more
general.
Example 1: [Intensified desferrioxamine (TREATMENT) treatment (by either subcutaneous or
intravenous route) or use of other oral iron chelators, or both, remains the established treatment to
reverse cardiac dysfunction due to iron overload ( DISEASE)]
So from 503151, only 291575 sentences remain with a single pair of semantic type “dsyn-bacs”
and 6699 were labelled with relation from MRREL. Then the left is 284876 unlabelled
sentences. Final training set after removing the sentences with more than one relation consisted of
4171 (positive and negative) examples with specific relationship as follow:
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
19
• RO=null, 1493 examples
• RO=gene_product_malfunction_associated_with_disease, 1246 examples
• RO=related_to, 539 examples
• RO=may_treat, 311 examples
• Other, 582 examples.
Since our target examples of relation is “may_treat” we observed that “null” and “related_to”
relations will not serve this relation between treatment and disease entities, if we consider
example 2, we can observe that “METABOLIC SYNDROME” does not treat or prevent the
disease “CHOLESTEROL”, but they are related to each other in another way. For that, we
exclude “null” and “related_to” examples from the training data set examples.
Example 2: [BACKGROUND: To establish the rate of agreement in predicting METABOLIC
SYNDROME (TREATMENT) (ms) in different pediatric classifications using percentiles or
fixed cut-offs, as well as exploring the influence of CHOLESTEROL (DISEASE )]
After excluding “related_to” and “null” relation from the positive training example, we got 2139
labelled examples.
4.2. Testing set construction
We used two test data sets for evaluation of our classifier. The first test set constructed by
combining different relation mining data sets so that it could be similar to a training set. The
second test set we used the same test set presented in [9] after their permission.
In the first test set, we employed three most specific and frequent relations: "may_treat",
"gene_product_malfunction_associated_with_disease " and "other" to serve our training set that
contains these relations. Further, we identify this data set as “Triple relation” test set (for
simplicity). For this test set, 70 examples of “other” relation were labelled manually. 500
“may_treat” examples and 60 “other” examples were obtained from disease-treatment relations
test set in[17]. 500 examples of “gene_product_malfunction_associated_with_disease” were
randomly chosen among positive examples of gene-disease relation test set in [18].
The second test set from [9] contains 227 examples of “other” relations and 173 examples of
“may_treat” relations. We will call this set “may_treat.” test set. And science it is important to
keep in the training set only those relations that presented in the test set, we exclude the relation
“gene_malfunction_is_associated_with_disease” from the training set examples to evaluate using
the test set “may_treat” from [9].
5. RESULT AND DISCUSSION
The Results of the model proposed by Roller and other in [9] are shown in Table1.
10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
20
Table1. result of [9] on evaluation based on “may_treat” test set.
For evaluation we used the most common metrics that used in classifier evaluation in: Precision,
Recall and F-Score which defined in equations (1), (2), and (3) respectively:
ܲ݊݅ݏ݅ܿ݁ݎ =
்
்ାி
(1)
ܴ݈݈݁ܿܽ =
்
்ାிே
(2)
ܨ − ܵܿ݁ݎ = 2 ∗
௦∗
௦ା
(3)
Where: TP is the true positive results of classification, FP is the false positive results of
classification and FN is the false negative.
By applying a different combination of features, that discussed in section 3.2.2. The model
shows better results in comparison with [9] work, in term of Precision, Recall, and F-Score when
using Linear SVM as the algorithm of classification and Words between entities as basic feature
based on “Triple relation” test set as shown in Table 2. Furthermore, as shown in Table 3 and
based on “may_treat” test set, the better result achieved in the term of Recall and F-Score when
using Words between entities features with words on semantic path features using Linear SVM
algorithm which outperform the best result in [9], this indicated the efficiency of the proposed
approach in constructing training examples using semantic types of biomedical entities in UMLS.
Table 2. Model results of evaluation based on "Triple relation" test set.
Method Accuracy Average
Precision
Average
Recall
Average
F-Score
Naive approach 0.38 0.37 3 0.31
Words between entities, using Linear SVM 0.62 0.76 0.62 0.64
Table 3. Model results of evaluation based on “may_treat” test set.
Method Accuracy Precision Recall F-Score
Naive approach 0.57 0.32 0.57 0.41
Words between entities + words on semantic
path, using Linear SVM
0.60 0.54 0.72 0.62
Figure 4 shows the comparison between the result of the model based on “may_treat" test set by
using entities semantic type, and the result in [9] which did not use semantic type in their
approach. As we can see that the proposed model achieved best results among the other in term
of Recall and F-Score.
Precision Recall F-Score
5,000 training instances 0.273 0.273 0.273
10,000 training instances 0.600 0.545 0.571
20,000 training instances 0.417 0.455 0.435
11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
21
Figure 4: Comparison of Precision, Recall, and F-Score of proposed model and other.
6. CONCLUSION AND FUTURE WORK
In this paper, we propose Two-Level Knowledge Discovery for Relation Extraction using UMLS
knowledge base and demonstrated model performance on MEDLINE data. We used a Self-
supervised approach for relation extraction by incorporate data mining and machine learning
techniques. Additionally, we proposed our own approach to constructing the training set
examples with positive and negative instances based on entities semantic type from MRREL
section in UMLS. The approach achieved better result in term of Precisions, Recall and F-Score
with 0.76, 0.62 and 0.64 respectively on “Triple relation” test set, and 0.72 of Recall and 0.62
F-Score on “may_treat” test set. The model also demonstrates an approach to minimize the cost
of relation extraction by using a weekly labelled training example using UMLS. Our future plan
is to deal with multiple data and knowledge sources by developing an algorithm for prioritizing
relation examples from different corpus and knowledge base.
REFERENCES
[1] A. Bchir and W. B. A. Karaa, "Extraction of drug-disease relations from MEDLINE abstracts," in
Computer and Information Technology (WCCIT), 2013 World Congress on, 2013, pp. 1-3.
[2] S. Benomrane, M. Ben Ayed, and A. M. Alimi, "An agent-based Knowledge Discovery from
Databases applied in healthcare domain," in Advanced Logistics and Transport (ICALT), 2013
International Conference on, 2013, pp. 176-180.
[3] V. N. Romero, S. Kudama, and R. B. Llavori, "Towards the Discovery of Semantic Relations in
Large Biomedical Annotated Corpora," in Database and Expert Systems Applications (DEXA), 2011
22nd International Workshop on, 2011, pp. 465-469.
[4] L. Yao, C. J. Sun, X. L. Wang, and X. Wang, "Relationship extraction from biomedical literature
using Maximum Entropy based on rich features," in Machine Learning and Cybernetics (ICMLC),
2010 International Conference on, 2010, pp. 3358-3361.
12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
22
[5] L. Mengwen, L. Yuan, A. Yuan, H. Xiaohua, A. Yagoda, and R. Misra, "Relation extraction from
biomedical literature with minimal supervision and grouping strategy," in Bioinformatics and
Biomedicine (BIBM), 2014 IEEE International Conference on, 2014, pp. 444-449.
[6] P. Thomas, "Robust relationship extraction in the biomedical domain," Mathematisch-
Naturwissenschaftliche Fakultät, 2015.
[7] M. Craven and J. Kumlien, "Constructing Biological Knowledge Bases by Extracting Information
from Text Sources," presented at the Proceedings of the Seventh International Conference on
Intelligent Systems for Molecular Biology, 1999.
[8] P. Thomas, I. Solt, R. Klinger, and U. Leser, "Learning protein protein interaction extraction using
distant supervision," Robust Unsupervised and Semi-Supervised Methods in Natural Language
Processing, pp. 34-41, 2011.
[9] R. Roller and M. Stevenson, "Self-supervised Relation Extraction Using UMLS," in Information
Access Evaluation. Multilinguality, Multimodality, and Interaction. vol. 8685, E. Kanoulas, M. Lupu,
P. Clough, M. Sanderson, M. Hall, A. Hanbury, et al., Eds., ed: Springer International Publishing,
2014, pp. 116-127.
[10] R. Roller and M. Stevenson, "Applying UMLS for Distantly Supervised Relation Detection," in
Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis
(Louhi), 2014, pp. 80-84.
[11] P. Thomas, T. Bobic, M. Hofmann-Apitius, U. Leser, and R. Klinger, "Weakly Labelled Corpora as
Silver Standard for Drug-Drug and Protein-Protein Interaction," Third Workshop on Building and
Evaluating Resources for Biomedical Text Mining Workshop Programme, p. 63, 2012.
[12] M. Mintz, S. Bills, R. Snow, and D. Jurafsky, "Distant supervision for relation extraction without
labelled data," presented at the Proceedings of the Joint Conference of the 47th Annual Meeting of
the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP:
Volume 2 - Volume 2, Suntec, Singapore, 2009.
[13] S. Riedel, L. Yao, and A. McCallum, "Modeling relations and their mentions without labelled text,"
presented at the Proceedings of the 2010 European conference on Machine learning and knowledge
discovery in databases: Part III, Barcelona, Spain, 2010.
[14] R. Hoffmann, C. Zhang, and D. S. Weld, "Learning 5000 relational extractors," presented at the
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala,
Sweden, 2010.
[15] O. Frunza, D. Inkpen, and T. Tran, "A Machine Learning Approach for Identifying Disease-
Treatment Relations in Short Texts," IEEE Transactions on Knowledge and Data Engineering, vol.
23, pp. 801-814, 2011.
[16] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, "The Stanford
CoreNLP Natural Language Processing Toolkit," in ACL Demonstrations, 2014.
[17] B. Rosario and M. A. Hearst, "Classifying semantic relations in bioscience texts," presented at the
Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona,
Spain, 2004.
13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.6, No.3, May 2016
23
[18] À. Bravo, J. Piñero, N. Queralt-Rosinach, M. Rautschka, and L. I. Furlong, "Extraction of relations
between genes and diseases from text and large-scale data analysis: implications for translational
research," BMC Bioinformatics, vol. 16, pp. 1-17, 2015.
AUTHORS
Huda Banuqitah is Teaching Assistant in Information Technology Department , Faculty of Computing
and Information Technology, King Abdulaziz University, where she is Currently Master Student .
Fathy Essa :received the B.Sc degree in electronics and electrical communication
engineering from Cairo University, Egypt in 1978, and the M. Sc. degree in
computers and Systems engineering from Al Azhar University, cairo, Egypt in
1984, and Ph.D degree in computers and systems engineering from Al-Azhar
University , Cairo, Egypt with joint supervision with University of Colorado,
U.S.A, in 1989. He is a full professor with computer Science dept, Faculty of
Computing and Information technology, King Abdullaziz University, Saudi
Arabia. His research interests include agent based software engineering, cloud
computing, software engineering, big data, distributed systems, exascale system
testing
Kamal M Jambi received the B.Sc degree with honor Computer Science from
University of Petroleum and Mineral, KSA in 1982, and the M. Sc. from Michigan
State University, MI, USA in 1986, and Ph.D degree from Illinois Institute of
Technology, IL, U.S.A, in 1991. He is a full professor with Computer Science dept,
Faculty of Computing and Information technology, King Abdullaziz University,
Saudi Arabia. His research interests include OCR, NLP, Image Processing, software
engineering, big data, distributed systems
Maysoon Abulkhair is an Assistant Professor and the Supervisor of IT department at King Abdulaziz
University, Jeddah, KSA. The major interested research field is HCI, associating it with different
knowledge area such as artificial intelligent, machine learning, and data mining.
i
Unified Medical Language System (UMLS)
ii
Semantic Types of UMLS Concepts
iii
MRREL table description
iv
- http://ii.nlm.nih.gov/MMBaseline/
v
https://metamap.nlm.nih.gov
vi
https://metamap.nlm.nih.gov/Docs/2012_MMO.pdf