This summarizes a research paper that proposes a biomedical indexing and retrieval system called BIOINSY. It uses a language modeling approach to select the best Medical Subject Headings (MeSH) descriptors to index medical articles from sources like PUBMED. The system first preprocesses articles by splitting text, stemming words, and removing stop words. It then extracts terms using a hybrid linguistic and statistical approach. Terms are weighted based on semantic relationships in MeSH, not just statistics. Descriptors are selected by disambiguating terms and estimating the probability a descriptor was generated by the article's language model. Experiments showed the effectiveness of this conceptual indexing approach.
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSIJDKP
The biomedical research literature is one among many other domains that hides a precious knowledge, and
the biomedical community made an extensive use of this scientific literature to discover the facts of
biomedical entities, such as disease, drugs,etc.MEDLINE is a huge database of biomedical research
papers which remain a significantly underutilized source of biological information. Discovering the useful
knowledge from such huge corpus leads to various problems related to the type of information such as the
concepts related to the domain of texts and the semantic relationship associated with them. In this paper,
we propose a Two-level model for Self-supervised relation extraction from MEDLINE using Unified
Medical Language System (UMLS) Knowledge base. The model uses a Self-supervised Approach for
Relation Extraction (RE) by constructing enhanced training examples using information from UMLS. The
model shows a better result in comparison with current state of the art and naïve approaches
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Ontology oriented concept based clusteringeSAT Journals
Abstract Worldwide health centre scientists, physicians and other patients are accessing, analyzing, integrating and storing massive amounts of digital medical data in different database. The potential for retrieval of information is vast and daunting. The objective of our approach is to differentiate relevant information from irrelevant through user friendly and efficient search algorithms. The traditional solution employs keyword based search without the semantic consideration. So the keyword retrieval may return inaccurate and incomplete results. In order to overcome the problem of information retrieval from this huge amount of database, there is a need for concept based clustering method in ontology. In the proposed method, WorldNet is integrated in order to match the synonyms for the identified keywords so as to obtain the accurate information and it presents the concept based clustering developed using k-means algorithm in accordance with the principles of ontology so that the importance of words of a cluster can be identified. Keywords: Ontology, Concept based clustering, K-means algorithm and information retrieval.
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This
paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean
model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the
researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing
one single patient’s information gathered from a large volume of data over a long period of time. The
main objective of this paper is to represent our ontology-based information retrieval approach for
clinical Information System. We have performed a Case Study in the real life hospital settings. The results
obtained illustrate the feasibility of the proposed approach which significantly improved the information
retrieval process on a large volume of data over a long period of time from August 2011 until January
2012
An Overview on the Use of Data Mining and Linguistics Techniques for Building...ijcsit
The usage of Online Social Networks (OSN), such as Facebook and Twitter are becoming more and more
popular in order to exchange and disseminate news and information in real-time. Twitter in particular
allows the instant dissemination of short messages in the form of microblogs to followers. This Survey
reviews literature to explore and examine the usage of how OSNs, such as the microblogging tool Twitter,
can help in the detection of spreading epidemics. The paper highlights significant challenges in the field of
Natural Language Processing (NLP) when using microblog based Early Disease Detection Systems. For
instance, microblogging data is an unstructured collection of short messages (140 characters in Twitter),
with noise and non-standard use of the English language. Hence, research is currently exploring the field
of linguistics in order to determine the semantics of the text and uses data mining techniques in order to
extract useful information for disease spread detection. Furthermore, the survey discusses applications and
existing early disease detection systems based on OSNs and outlines directions for future research on
improving such systems based on a combination of linguistics methods, data mining techniques and
recommendation systems.
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSIJDKP
The biomedical research literature is one among many other domains that hides a precious knowledge, and
the biomedical community made an extensive use of this scientific literature to discover the facts of
biomedical entities, such as disease, drugs,etc.MEDLINE is a huge database of biomedical research
papers which remain a significantly underutilized source of biological information. Discovering the useful
knowledge from such huge corpus leads to various problems related to the type of information such as the
concepts related to the domain of texts and the semantic relationship associated with them. In this paper,
we propose a Two-level model for Self-supervised relation extraction from MEDLINE using Unified
Medical Language System (UMLS) Knowledge base. The model uses a Self-supervised Approach for
Relation Extraction (RE) by constructing enhanced training examples using information from UMLS. The
model shows a better result in comparison with current state of the art and naïve approaches
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Ontology oriented concept based clusteringeSAT Journals
Abstract Worldwide health centre scientists, physicians and other patients are accessing, analyzing, integrating and storing massive amounts of digital medical data in different database. The potential for retrieval of information is vast and daunting. The objective of our approach is to differentiate relevant information from irrelevant through user friendly and efficient search algorithms. The traditional solution employs keyword based search without the semantic consideration. So the keyword retrieval may return inaccurate and incomplete results. In order to overcome the problem of information retrieval from this huge amount of database, there is a need for concept based clustering method in ontology. In the proposed method, WorldNet is integrated in order to match the synonyms for the identified keywords so as to obtain the accurate information and it presents the concept based clustering developed using k-means algorithm in accordance with the principles of ontology so that the importance of words of a cluster can be identified. Keywords: Ontology, Concept based clustering, K-means algorithm and information retrieval.
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This
paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean
model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the
researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing
one single patient’s information gathered from a large volume of data over a long period of time. The
main objective of this paper is to represent our ontology-based information retrieval approach for
clinical Information System. We have performed a Case Study in the real life hospital settings. The results
obtained illustrate the feasibility of the proposed approach which significantly improved the information
retrieval process on a large volume of data over a long period of time from August 2011 until January
2012
An Overview on the Use of Data Mining and Linguistics Techniques for Building...ijcsit
The usage of Online Social Networks (OSN), such as Facebook and Twitter are becoming more and more
popular in order to exchange and disseminate news and information in real-time. Twitter in particular
allows the instant dissemination of short messages in the form of microblogs to followers. This Survey
reviews literature to explore and examine the usage of how OSNs, such as the microblogging tool Twitter,
can help in the detection of spreading epidemics. The paper highlights significant challenges in the field of
Natural Language Processing (NLP) when using microblog based Early Disease Detection Systems. For
instance, microblogging data is an unstructured collection of short messages (140 characters in Twitter),
with noise and non-standard use of the English language. Hence, research is currently exploring the field
of linguistics in order to determine the semantics of the text and uses data mining techniques in order to
extract useful information for disease spread detection. Furthermore, the survey discusses applications and
existing early disease detection systems based on OSNs and outlines directions for future research on
improving such systems based on a combination of linguistics methods, data mining techniques and
recommendation systems.
INTEGRATED, RELIABLE AND CLOUD-BASED PERSONAL HEALTH RECORD: A SCOPING REVIEWhiij
Personal Health Records (PHR) emerge as an alternative to integrate patient’s health information to give a
global view of patients' status. However, integration is not a trivial feature when dealing with a variety
electronic health systems from healthcare centers. Access to PHR sensitive information must comply with
privacy policies defined by the patient. Architecture PHR design should be in accordance to these, and take
advantage of nowadays technology. Cloud computing is a current technology that provides scalability,
ubiquity, and elasticity features. This paper presents a scoping review related to PHR systems that achieve
three characteristics: integrated, reliable and cloud-based. We found 101 articles that addressed
thosecharacteristics. We identified four main research topics: proposal/developed systems, PHR
recommendations for development, system integration and standards, and security and privacy. Integration
is tackled with HL7 CDA standard. Information reliability
Using a Bag of Words for Automatic Medical Image Annotation with a Latent Sem...ijaia
We present in this paper a new approach for the automatic annotation of medical images, using the
approach of "bag-of-words" to represent the visual content of the medical image combined with text
descriptors based approach tf.idf and reduced by latent semantic to extract the co-occurrence between
terms and visual terms. A medical report is composed of a text describing a medical image. First, we are
interested to index the text and extract all relevant terms using a thesaurus containing MeSH medical
concepts. In a second phase, the medical image is indexed while recovering areas of interest which are
invariant to change in scale, light and tilt. To annotate a new medical image, we use the approach of "bagof-
words" to recover the feature vector. Indeed, we use the vector space model to retrieve similar medical
image from the database training. The calculation of the relevance value of an image to the query image is
based on the cosine function. We conclude with an experiment carried out on five types of radiological
imaging to evaluate the performance of our system of medical annotation. The results showed that our
approach works better with more images from the radiology of the skull.
Poster presented at the XIII Brazilian Congress of Health Informatics -2012.
See: http://www.mlhim.org http://gplus.to/MLHIM and http://gplus.to/MLHIMComm for more information about semantic interoperability in healthcare.
#mlhim #semantic_interoperability #health_informatics
The article outlines extensions to the model and algorithm of spyware detection
procedures which, in particular, presents a potential threat to medical information
systems. The approaches and solutions in this paper allow to programmatically
implement binary file segmentation using discrete wavelet transform (WT). Unlike the
existing approaches, the model and algorithm proposed in the article took into
account local extrema of the wavelet coefficients (WC). The described solutions allow
for analysis of potentially dangerous and spyware files the number of bytes, as well as
the entropy for individual segments of files that are transmitted for analysis
Towards optimize-ESA for text semantic similarity: A case study of biomedical...IJECEIAES
Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach.
Using data mining methods knowledge discovery for text miningeSAT Journals
Abstract Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. Proposed work presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Keywords:-Text mining, text classification, pattern mining, pattern evolving, information filtering.
CLOUD-BASED DEVELOPMENT OF SMART AND CONNECTED DATA IN HEALTHCARE APPLICATIONijdpsjournal
There is a need of data integration in cloud – based system, we propose an Information Integration and Informatics framework for cloud – based healthcare application. The data collected by the Electronic Health Record System need to be smart and connected, so we use informatica for the connection of data
from different database. Traditional Electronic Health Record Systems are based on different technologies, languages and Electronic Health Record Standards. Electronic Health Record System stores data based on interaction between patient and provider. There are scalable cloud infrastructures, distributed and heterogeneous healthcare systems and there is a need to develop advanced healthcare application. This advance healthcare application will improve the integration of required data and helps in fast interaction between the patient and the service providers. Thus there is the development of smart
and connected data in healthcare application of cloud. The proposed system is developed by using cloud platform Aneka.
Vertical intent prediction approach based on Doc2vec and convolutional neural...IJECEIAES
Vertical selection is the task of selecting the most relevant verticals to a given query in order to improve the diversity and quality of web search results. This task requires not only predicting relevant verticals but also these verticals must be those the user expects to be relevant for his particular information need. Most existing works focused on using traditional machine learning techniques to combine multiple types of features for selecting several relevant verticals. Although these techniques are very efficient, handling vertical selection with high accuracy is still a challenging research task. In this paper, we propose an approach for improving vertical selection in order to satisfy the user vertical intent and reduce user’s browsing time and efforts. First, it generates query embeddings vectors using the doc2vec algorithm that preserves syntactic and semantic information within each query. Secondly, this vector will be used as input to a convolutional neural network model for increasing the representation of the query with multiple levels of abstraction including rich semantic information and then creating a global summarization of the query features. We demonstrate the effectiveness of our approach through comprehensive experimentation using various datasets. Our experimental findings show that our system achieves significant accuracy. Further, it realizes accurate predictions on new unseen data.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Nlp based retrieval of medical information for diagnosis of human diseaseseSAT Journals
Abstract NLP Based Retrieval of Medical Information is the extraction of medical data from narrative clinical documents. In this paper, we provide the way to diagnose diseases with the help of natural language interpretation and classification techniques. However extraction of medical information is difficult task due to complex symptom names and complex disease names. For diagnosis we will be using two approaches, one is getting disease names with the help of classifiers and another way is using the patterns with the help of NLP for getting the information related to diseases. These both approaches will be applied according to the question type. Keywords: NLP, narrative text, extraction, medical information, expert system
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
INTEGRATED, RELIABLE AND CLOUD-BASED PERSONAL HEALTH RECORD: A SCOPING REVIEWhiij
Personal Health Records (PHR) emerge as an alternative to integrate patient’s health information to give a
global view of patients' status. However, integration is not a trivial feature when dealing with a variety
electronic health systems from healthcare centers. Access to PHR sensitive information must comply with
privacy policies defined by the patient. Architecture PHR design should be in accordance to these, and take
advantage of nowadays technology. Cloud computing is a current technology that provides scalability,
ubiquity, and elasticity features. This paper presents a scoping review related to PHR systems that achieve
three characteristics: integrated, reliable and cloud-based. We found 101 articles that addressed
thosecharacteristics. We identified four main research topics: proposal/developed systems, PHR
recommendations for development, system integration and standards, and security and privacy. Integration
is tackled with HL7 CDA standard. Information reliability
Using a Bag of Words for Automatic Medical Image Annotation with a Latent Sem...ijaia
We present in this paper a new approach for the automatic annotation of medical images, using the
approach of "bag-of-words" to represent the visual content of the medical image combined with text
descriptors based approach tf.idf and reduced by latent semantic to extract the co-occurrence between
terms and visual terms. A medical report is composed of a text describing a medical image. First, we are
interested to index the text and extract all relevant terms using a thesaurus containing MeSH medical
concepts. In a second phase, the medical image is indexed while recovering areas of interest which are
invariant to change in scale, light and tilt. To annotate a new medical image, we use the approach of "bagof-
words" to recover the feature vector. Indeed, we use the vector space model to retrieve similar medical
image from the database training. The calculation of the relevance value of an image to the query image is
based on the cosine function. We conclude with an experiment carried out on five types of radiological
imaging to evaluate the performance of our system of medical annotation. The results showed that our
approach works better with more images from the radiology of the skull.
Poster presented at the XIII Brazilian Congress of Health Informatics -2012.
See: http://www.mlhim.org http://gplus.to/MLHIM and http://gplus.to/MLHIMComm for more information about semantic interoperability in healthcare.
#mlhim #semantic_interoperability #health_informatics
The article outlines extensions to the model and algorithm of spyware detection
procedures which, in particular, presents a potential threat to medical information
systems. The approaches and solutions in this paper allow to programmatically
implement binary file segmentation using discrete wavelet transform (WT). Unlike the
existing approaches, the model and algorithm proposed in the article took into
account local extrema of the wavelet coefficients (WC). The described solutions allow
for analysis of potentially dangerous and spyware files the number of bytes, as well as
the entropy for individual segments of files that are transmitted for analysis
Towards optimize-ESA for text semantic similarity: A case study of biomedical...IJECEIAES
Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach.
Using data mining methods knowledge discovery for text miningeSAT Journals
Abstract Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. Proposed work presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Keywords:-Text mining, text classification, pattern mining, pattern evolving, information filtering.
CLOUD-BASED DEVELOPMENT OF SMART AND CONNECTED DATA IN HEALTHCARE APPLICATIONijdpsjournal
There is a need of data integration in cloud – based system, we propose an Information Integration and Informatics framework for cloud – based healthcare application. The data collected by the Electronic Health Record System need to be smart and connected, so we use informatica for the connection of data
from different database. Traditional Electronic Health Record Systems are based on different technologies, languages and Electronic Health Record Standards. Electronic Health Record System stores data based on interaction between patient and provider. There are scalable cloud infrastructures, distributed and heterogeneous healthcare systems and there is a need to develop advanced healthcare application. This advance healthcare application will improve the integration of required data and helps in fast interaction between the patient and the service providers. Thus there is the development of smart
and connected data in healthcare application of cloud. The proposed system is developed by using cloud platform Aneka.
Vertical intent prediction approach based on Doc2vec and convolutional neural...IJECEIAES
Vertical selection is the task of selecting the most relevant verticals to a given query in order to improve the diversity and quality of web search results. This task requires not only predicting relevant verticals but also these verticals must be those the user expects to be relevant for his particular information need. Most existing works focused on using traditional machine learning techniques to combine multiple types of features for selecting several relevant verticals. Although these techniques are very efficient, handling vertical selection with high accuracy is still a challenging research task. In this paper, we propose an approach for improving vertical selection in order to satisfy the user vertical intent and reduce user’s browsing time and efforts. First, it generates query embeddings vectors using the doc2vec algorithm that preserves syntactic and semantic information within each query. Secondly, this vector will be used as input to a convolutional neural network model for increasing the representation of the query with multiple levels of abstraction including rich semantic information and then creating a global summarization of the query features. We demonstrate the effectiveness of our approach through comprehensive experimentation using various datasets. Our experimental findings show that our system achieves significant accuracy. Further, it realizes accurate predictions on new unseen data.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Nlp based retrieval of medical information for diagnosis of human diseaseseSAT Journals
Abstract NLP Based Retrieval of Medical Information is the extraction of medical data from narrative clinical documents. In this paper, we provide the way to diagnose diseases with the help of natural language interpretation and classification techniques. However extraction of medical information is difficult task due to complex symptom names and complex disease names. For diagnosis we will be using two approaches, one is getting disease names with the help of classifiers and another way is using the patterns with the help of NLP for getting the information related to diseases. These both approaches will be applied according to the question type. Keywords: NLP, narrative text, extraction, medical information, expert system
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
Domain ontology development for communicable diseasescsandit
Web has become the very first resource to search for any kind of information. With the
emergence of semantic web, our search queries have started generating more informed results.
Ontologies are at the core of any semantic web application. They help in rapid development of
distributed systems by providing information on the fly. This key feature of distribution and
sharing of information has made ontologies as a new knowledge representation mechanism. A
mechanism which is strongly backed by a sound inference system. In this paper, we shall discuss
the development, verification and validation of an ontology in a health domain.
DOMAIN ONTOLOGY DEVELOPMENT FOR COMMUNICABLE DISEASEScscpconf
Web has become the very first resource to search for any kind of information. With the emergence of semantic web, our search queries have started generating more informed results.Ontologies are at the core of any semantic web application. They help in rapid development of
distributed systems by providing information on the fly. This key feature of distribution and
sharing of information has made ontologies as a new knowledge representation mechanism. A
mechanism which is strongly backed by a sound inference system. In this paper, we shall discuss the development, verification and validation of an ontology in a health domain.
Chapter 15 Information Retrieval from Medical Knowledge Resourc.docxketurahhazelhurst
Chapter 15: Information Retrieval from Medical Knowledge Resources
WILLIAM R. HERSH
Learning Objectives
After viewing this presentation, viewers should be able to:
Enumerate the basic biomedical and health knowledge resources in books, journals, electronic databases, and other sources
Describe the major approaches used to indexing knowledge-based content
Apply advanced searching techniques to the major biomedical and health knowledge resources
Discuss the major results of information retrieval evaluation studies
Describe future directions for research in information retrieval
Introduction
Information Retrieval (IR), sometimes called search, concerns the acquisition, organization, and searching of knowledge-based information, which is usually defined as information derived and organized from observational or experimental research
Although IR in biomedicine traditionally concentrated on the retrieval of text from the biomedical literature, the study has expanded to include newer types of media that include images, video, chemical structures, gene and protein sequences, and a wide range of other digital media of relevance to biomedical education, research, and patient care
Introduction
The overall goal of the IR process is to find content that meets a person’s information needs
Components of information retrieval systems
IR tends to focus on knowledge-based information
Knowledge-based information categories:
Primary knowledge–based information (also called primary literature) is original research that appears in journals, books, reports, and other sources
Secondary knowledge–based information consists of the writing that reviews, condenses, and/or synthesizes the primary literature. The most common examples of this type of literature are books, monographs, and review articles in journals and other publications
Knowledge Based Information
Virtually all scientific journals are published electronically
Not only is there the increased convenience of redistributing articles, but research has found that freely available on the Web have a higher likelihood of being cited by other papers than those that are not (Bork 2012)
Printing and mailing, tasks no longer needed in electronic publishing, comprised a significant part of the “added value” from publishers of journals. There is still however value added by publishers, such as hiring and managing editorial staff to produce the journals and managing the peer review process
Publication of Knowledge-Based Information
The basic principle of open access publishing is that authors and/or their institutions pay the cost of production of manuscripts up front after they are accepted through a peer review process. After the paper is published, it becomes freely available on the Web. Since most research is usually funded by grants, the cost of open access publishing should be included in grant budgets. The uptake of publishers adhering to ...
Chapter 15 Information Retrieval from Medical Knowledge Resourc.docxbartholomeocoombs
Chapter 15: Information Retrieval from Medical Knowledge Resources
WILLIAM R. HERSH
Learning Objectives
After viewing this presentation, viewers should be able to:
Enumerate the basic biomedical and health knowledge resources in books, journals, electronic databases, and other sources
Describe the major approaches used to indexing knowledge-based content
Apply advanced searching techniques to the major biomedical and health knowledge resources
Discuss the major results of information retrieval evaluation studies
Describe future directions for research in information retrieval
Introduction
Information Retrieval (IR), sometimes called search, concerns the acquisition, organization, and searching of knowledge-based information, which is usually defined as information derived and organized from observational or experimental research
Although IR in biomedicine traditionally concentrated on the retrieval of text from the biomedical literature, the study has expanded to include newer types of media that include images, video, chemical structures, gene and protein sequences, and a wide range of other digital media of relevance to biomedical education, research, and patient care
Introduction
The overall goal of the IR process is to find content that meets a person’s information needs
Components of information retrieval systems
IR tends to focus on knowledge-based information
Knowledge-based information categories:
Primary knowledge–based information (also called primary literature) is original research that appears in journals, books, reports, and other sources
Secondary knowledge–based information consists of the writing that reviews, condenses, and/or synthesizes the primary literature. The most common examples of this type of literature are books, monographs, and review articles in journals and other publications
Knowledge Based Information
Virtually all scientific journals are published electronically
Not only is there the increased convenience of redistributing articles, but research has found that freely available on the Web have a higher likelihood of being cited by other papers than those that are not (Bork 2012)
Printing and mailing, tasks no longer needed in electronic publishing, comprised a significant part of the “added value” from publishers of journals. There is still however value added by publishers, such as hiring and managing editorial staff to produce the journals and managing the peer review process
Publication of Knowledge-Based Information
The basic principle of open access publishing is that authors and/or their institutions pay the cost of production of manuscripts up front after they are accepted through a peer review process. After the paper is published, it becomes freely available on the Web. Since most research is usually funded by grants, the cost of open access publishing should be included in grant budgets. The uptake of publishers adhering to.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : A C...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing one single patient’s information gathered from a large volume of data over a long period of time. The main objective of this paper is to represent our ontology-based information retrieval approach for clinical Information System. We have performed a Case Study in the real life hospital settings. The results obtained illustrate the feasibility of the proposed approach which significantly improved the information retrieval process on a large volume of data over a long period of time from August 2011 until January 2012.
Visual Analytics and the Language of Web Query Logs - A Terminology PerspectiveFindwise
This paper explores means to integrate natural language processing methods for terminology and entity identification in medical web session logs with visual analytics techniques. The aim of the study is to examine whether the vocabulary used in queries posted to a Swedish regional health web site can be assessed in a way that will enable a terminologist or medical data analysts to instantly identify new term candidates and their relations based on significant co-occurrence patterns. We provide an example application in order to illustrate how the co-occurrence relationships between medical and general entities occurring in such logs can be visualized, accessed and explored. To enable a visual exploration of the generated co-occurrence graphs, we employ a general purpose social network analysis tool, http://visone.info), that permits to visualize and analyze various types of graph structures. Our examples show that visual analytics based on co-occurrence analysis provides insights into the use of layman language in relation to established (professional) terminologies, which may help terminologists decide which terms to include in future terminologies. Increased understanding of the used querying language is also of interest in the context of public health web sites. The query results should reflect the intentions of the information seekers, who may express themselves in layman language that differs from the one used on the available web sites provided by medical professionals.
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...Editor IJCATR
Most of the existing semantic similarity measures that use ontology structure as their primary source can measure semantic similarity between concepts/classes using single ontology. The ontology-based semantic similarity techniques such as structure-based semantic similarity techniques (Path Length Measure, Wu and Palmer’s Measure, and Leacock and Chodorow’s measure), information content-based similarity techniques (Resnik’s measure, Lin’s measure), and biomedical domain ontology techniques (Al-Mubaid and Nguyen’s measure (SimDist)) were evaluated relative to human experts’ ratings, and compared on sets of concepts using the ICD-10 “V1.0” terminology within the UMLS. The experimental results validate the efficiency of the SemDist technique in single ontology, and demonstrate that SemDist semantic similarity techniques, compared with the existing techniques, gives the best overall results of correlation with experts’ ratings.
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM csandit
The aim of this paper is to use data mining techniq
ue and opinion mining(OM) concepts to the
field of health informatics. The decision making in
health informatics involves number of
opinions given by the group of medical experts for
specific disease in the form of decision based
opinions which will be presented in medical databas
e in the form of text. These decision based
opinions are then mined from database with the help
of mining technique. Text document
clustering plays major role in the fast developing
information Explosion. It is considered as tool
for performing information based operations. Text d
ocument clustering generates clusters from
whole document collection automatically, normally K
-means clustering technique used for text
document clustering. In this paper we use Bisecting
K-means clustering technique and it is
better compared to traditional K-means technique. T
he objective is to study the revealed
groupings of similar opinion-types associated with
the likelihood of physicians and medical
experts.
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
Semantic Similarity Measures between Terms in the Biomedical Domain within f...Editor IJCATR
The techniques and tests are tools used to define how measure the goodness of ontology or its resources. The similarity between biomedical classes/concepts is an important task for the biomedical information extraction and knowledge discovery. However, most of the semantic similarity techniques can be adopted to be used in the biomedical domain (UMLS). Many experiments have been conducted to check the applicability of these measures. In this paper, we investigate to measure semantic similarity between two terms within single ontology or multiple ontologies in ICD-10 “V1.0” as primary source, and compare my results to human experts score by correlation coefficient.
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSijsc
Classification is an important technique used in information retrieval. Supervised classification suffers
from certain limitations concerning the collection and labeling of the training dataset. When facing Multi-
Domain classification, multiple training datasets and classifiers are needed which is relatively difficult. In
this paper an unsupervised classification system is proposed that can manage the Multi-Domain
classification problem as well. It is a multi-domain system where each domain represented by an ontology.
A document is mapped on each ontology based on the weights of the mutual tokens between them with the
help of fuzzy sets, resulting in a mapping degree of the document with each domain. An experiment carried
out showing satisfying classification results with an improvement in the evaluation results of the proposed
system compared to Apache Lucene.
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents ijsc
Classification is an important technique used in information retrieval. Supervised classification suffers from certain limitations concerning the collection and labeling of the training dataset. When facing MultiDomain classification, multiple training datasets and classifiers are needed which is relatively difficult. In this paper an unsupervised classification system is proposed that can manage the Multi-Domain classification problem as well. It is a multi-domain system where each domain represented by an ontology. A document is mapped on each ontology based on the weights of the mutual tokens between them with the help of fuzzy sets, resulting in a mapping degree of the document with each domain. An experiment carried out showing satisfying classification results with an improvement in the evaluation results of the proposed system compared to Apache Lucene.
Similar to Biomedical indexing and retrieval system based on language modeling approach (20)
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...kalichargn70th171
A dynamic process unfolds in the intricate realm of software development, dedicated to crafting and sustaining products that effortlessly address user needs. Amidst vital stages like market analysis and requirement assessments, the heart of software development lies in the meticulous creation and upkeep of source code. Code alterations are inherent, challenging code quality, particularly under stringent deadlines.
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
A Sighting of filterA in Typelevel Rite of Passage
Biomedical indexing and retrieval system based on language modeling approach
1. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
DOI : 10.5121/ijsea.2012.3306 61
Biomedical indexing and retrieval system based on
language modeling approach
Jihen MAJDOUBI, Hatem LOUKI, Mohamed TMAR and Faiez GARGOURI
Multimedia InfoRmation system and Advanced Computing Laboratory, Higher Institute
of Information Technologie and Multimedia, University of sfax, Tunisia
majdoubi_jihene@yahoo.fr,hatem.loukil@isimsf.rnu.tn,
mohamed.tmar@isimsf.rnu.tn, faiez.gargouri@fsegs.rnu.tn
ABSTRACT
In the medical field, scientific articles represent a very important source of knowledge for researchers of
this domain. But due to the large volume of scientific articles published on the web, an efficient detection
and use of this knowledge is quite a difficult task.
In this paper, we propose our contribution for conceptual indexing of medical articles by using the MeSH
(Medical Subject Headings) thesaurus. With this in mind, we propose a tool for indexing medical articles
called BIOINSY (BIOmedical Indexing SYstem) which uses a language model for selecting the best
representative descriptors for each document.
The proposed indexing approach was evaluated by intensive experiments. These experiments were
conducted on document test collections of real world clinical extracted from scientific collections, namely
PUBMED and CLEF. The results generated by these experiments demonstrate the effectiveness of our
indexing approach.
KEYWORDS
Medical article, conceptual indexing, Language models, MeSH thesaurus, Biomedical Information
Retrieval
1. INTRODUCTION
The goal of an Information Retrieval System (IRS) is to retrieve relevant information to a user’s
query. This goal is quite a difficult task with the rapid and increasing development of the Internet.
Indeed, web information retrieval becomes more and more complex for user who IRS provides a
lot of information. However the user often fails, to find the best information in the context of his
need. Current IRS rely on simple matching keywords from queries to those from documents
under the basic assumption of terms independence. A document to be returned to the user should
contain at least one word of the query. However a document can be relevant even it does not
contain any word of the query. As a simple example, if the query is about “operating system”, a
document containing Windows, Unix, Vista, but not the term “operating system”, would not be
retrieved by classical search engines. Consequently, most IRS provide poor search results and the
recall is often low.
2. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
62
A suitable solution to this problem is the conceptual indexing process that uses the ontology’s
concepts as means of normalizing the document vocabulary. Thus, by using concepts of the
semantic resource (SR) and their description, IRS become able to understand the meaning of the
word and capture the document semantic content. As in the example of “operating system” cited
above, through concept recognition, the IRS can detect the relationships between the terms of
user’s query. Consequently IRS return the document that mentions Windows as an answer to the
query about “operating system”.
We focus here on the use of the SR to derive the semantic kernels of biomedical documents. In
this paper, we propose a novel method for conceptual indexing of medical articles by using the
MeSH (Medical Subject Headings) resource. The goal of this work is 2-fold. First, our goal is to
build a system which can annotate a medical article with relevant MeSH descriptors. Second, our
goal is to use this automatic annotation method to improve the biomedical document retrieval.
The remainder of this paper is organized as follows. After summarizing the background for this
problem in the next section, we present the previous work according to indexing medical articles
in section 3. We detail our conceptual indexing approach in section 4. An experimental evaluation
and comparison results are discussed in sections 5 and 6. After that, we try to use our conceptual
indexing approach to improve document retrieval by using ImageCLEF med 2007 test
collections. Finally section 8 presents some conclusions and future work directions.
2. BACKGROUND
2.1 Context
Each year, the rate of publication of scientific literature grows, making it increasingly harder for
researchers to keep up with novel relevant published work. In recent years several researches
have been devoted to attempt to manage effectively this huge volume of information, in many
fields.
In the medical field, scientific articles represent a very important source of knowledge for
researchers of this domain. The researcher usually needs to deal with a large amount of scientific
and technical articles for checking, validating and enriching of his research work.
This kind of information is often present in electronic biomedical resources available through the
Internet like CISMEF1
and PUBMED2
. However, the effort that the user put into the search is
often forgotten and lost.
To solve these issues, current health Information Systems must take advantage of recent advances
in knowledge representation and management areas such as the use of medical terminology
resources. Indeed, these resources aim at establishing the representations of knowledge through
which the computer can handle the semantic information.
2.2 Examples of medical terminology resources
The language of biomedical texts, like all natural language, is complex and poses problems of
synonymy and polysemy. Therefore, several terminological resources are available to cope with
the lexical ambiguity and synonymy present in biomedical terminology.
1
http://www.chu-rouen.fr/cismef/
2
http://www.ncbi.nlm.nih.gov/pubmed/
3. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
63
In this section, we present some examples of medical terminology resources:
• SNOMED is a coding system, controlled vocabulary, classification system and thesaurus.
It is a comprehensive clinical terminology designed to capture information about a
patient’s history, illnesses, treatment and outcomes.
• Galen (General Architecture for Language and Nomenclatures)3
is a system dedicated to
the development of ontology in all medical domains including surgical procedures.
• The Gene Ontology is a controlled vocabulary that covers three domains:
– Cellular component, the parts of a cell or its extra cellular environment,
– Molecular function, the elemental activities of a gene product at the molecular level,
such as binding or catalysis,
– Biological process, operations or sets of molecular Events.
• The Unified Medical Language System (UMLS) project was initiated in 1986 by the U.S.
National Library of Medicine (NLM). It consists of a (1) metathesaurus which collects
millions of terms belonging to nomenclatures and terminologies defined in the
biomedical domain and (2) a semantic network which consists of 135 semantic types and
54 relationships.
• The Medical Subject Headings (MeSH) thesaurus4
is a controlled vocabulary produced
by the National Library of Medicine (NLM) and used for indexing, and searching for
biomedical and health-related information and documents.
Us for us, we have chosen Mesh because it meets the aims of medical librarians and it is a
successful tool and widely used for indexing literature.
3. RELATED RESEARCH WORK
Automatic indexing of the medical articles has been investigated by several researchers. In this
section, we are only interested in the indexing approach using the MeSH thesaurus.
In [2], Névéol proposes a tool called MAIF (MesH Automatic Indexer for French) which is
developed within the CISMeF team. To index a medical resource, MAIF follows three steps:
analysis of the resource to be indexed, translation of the emerging concepts into the appropriate
controlled vocabulary (MeSH thesaurus) and revision of the resulting index.
In [4], the authors proposed the MTI (MeSH Terminology Indexer) to index English resources.
MTI results from the combination of two MeSH Indexing methods: MetaMap Indexing (MMI)
and a statistical, knowledge-based approach called PubMed Related Citations (PRC).
The MMI method [5] consists on discovering the Unified Medical Language System (UMLS)
concepts from the text. These UMLS concepts are then refined into MeSH terms.
The PRC method [6] computes a ranked list of MeSH terms for a given title and abstract by
finding the MEDLINE citations most closely related to the text based on the words shared by
3
http://www.opengalen.org
4
http://www.nlm.nih.gov/mesh/
4. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
64
both representations. Then, MTI combines the results of both methods by performing a specific
post processing task, to obtain a first list. This list is then devoted to a set of rules designed to
filter out irrelevant concepts. To do so, MTI provides three levels of filtering depending on
precision and recall: the strict filtering, the medium filtering and the base filtering.
Nomindex [1] recognizes concepts in a sentence and uses them to create a database allowing to
retrieve documents. Nomindex uses a lexicon derived from the ADM (Assisted Medical
Diagnosis) [7] which contains 130 000 terms. First, document words are mapped to ADM terms
and reduced to reference words. Then, ADM terms are mapped to the equivalent French MeSH
terms, and also to their UMLS Concept Unique Identifier. Each reference word of the document
is then associated with its corresponding UMLS. Finally a relevance score is computed for each
concept extracted from the document.
The conceptual indexing strategy proposed by [3] involves three steps. First they compute for
each concept MeSH C it’s similarity with the document D. After that, the candidate concepts
extracted from step 1 are re-ranked according to a correlation measure that estimates how much
the word order of a MeSH entry is correlated to the order of words in the document. Finally the
content based similarity and the correlation between the concept C and the document D are
combined in order to compute the overall relevance score. The N top ranked concepts having the
highest scores are selected as candidate concepts of the document D.
[32] introduced a retrieval-based system for MeSH Classification called EAGL. For each MeSH
term, its synonyms and description are indexed as a single document in a retrieval index. A piece
of text, the query to the retrieval system, is classified with the best ranked MeSH ‘documents’.
KNN (K-Nearest-Neighbours) indexing system [33] relies on a retrieval approach based on
language models. The parameters of the query language model are estimated on the text to index.
Next, citations most similar to this query language model are retrieved. The classification is based
on the MeSH terms assigned to the top K5
retrieved documents. The relevance of a MeSH term is
determined by summing the retrieval score of the top documents that have been assigned that
term.
Our indexing approach presented in this paper differs from previous works in the following key
points:
– Weighing term: to our knowledge, there is so far no work dealing with the use of the semantics
relationships in the weighing process for biomedical IR. Indeed, to determine a term importance,
the indexing approaches are based on the statistical measure (Tf, idf). So, any consideration of the
term semantic in the calculation of its weight is taken into account. For example, the term weight
is calculated independently of its synonym occurrences. However, [8] has showed the utility of
integrating the WordNet’s conceptual information to calculate term importance for web
information retrieval. Thus, motivated by the Search results of [8], we estimate term relevance for
a document by using its semantics relationships. To determine term importance, our indexing
strategy exploits its relationships in Mesh, rather than relying only on a statistical measure.
– Word Sense Disambiguation (WSD) technique: WSD consists in recognizing and assigning the
correct sense or Meaning of a given word. During the last years, many investigations on WSD
have been done. These investigations can be mainly subdivided into three categories: Knowledge
based [9][10], Supervised [11][12] and Unsupervised methods [13].
5
based on preceding experiments K =10 was used
5. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
65
Knowledge-based approaches are based on the semantic resources such as ontologies. Supervised
methods use manually annotated corpus for training classifiers. Unsupervised classifiers are
trained on unannotated corpora to extract several groups of similar texts.
In our approach, to disambiguate the senses of the term and determine it’s descriptor in the
context of the document, we use the language model approach. More precisely, to assess the
relevance of a MeSH descriptor to this document, we estimate the probability that the MeSH
descriptor would have been generated by language model of this document.
4. OUR CONCEPTUAL INDEXING APPROACH
Our indexing methodology as schematized in Figure 1, consists of four main steps: (a)
Pretreatment (b) term extraction (c) term weighing and (d) selection of descriptors. In the
following, we describe the structure of MeSH vocabulary and then we detail the steps of our
indexing method.
Figure 1. Architecture of our indexing approach
4.1. MeSH thesaurus
The structure of MeSH is centred on descriptors, concepts, and terms.
• Each term can be either a simple or a composed term.
• A concept is viewed as a class of synonyms terms. The preferred term gives its name to
the concept.
• A descriptor class consists of one or more concepts where each one is closely related to
each other in meaning. Each descriptor has a preferred concept. The descriptor’s name is
the name of the preferred concept.
NLP
tools
Des 3,
Des 7
Des 12
Document’s description
MeSH Thesaurus
Corpus
Pretreatment
Mesh after pretreatment
Corpus after pretreatment
Terms
extraction
NLP
tools
t1..t5
t2….t3
t8..t7
MeSH terms
(simpls and composed)
Terms
weighing
t1#fl1
t2#fl2
t3#fl3
t5#fl5
t7#fl7
t8#fl8
Each MeSH term may belong
to more than a descriptor
disambiguation by using
language modeling approach
Selection
of descriptors
The document’s description is built
by selecting the descriptors
with a probability> 0
6. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
66
Each of the subordinate concepts is related to the preferred concept by a relationship
(broader, narrower).
Figure 2. Extrait of MeSH
As shown by figure 2, the descriptor “Cardiomegaly” consists of two concepts: “Cardiomegaly”
and “Cardiac Hypertrophy”. Each concept has a preferred term, which is also said to be the name
of the Concept.
For example, the concept “Cardiomegaly” has three terms “Cardiomegaly” (preferred term) ,
“Enlarged Heart” and “Heart Enlargement”.
As in the example above, the concept “Cardiac Hypertrophy” is narrower to than the preferred
concept “Cardiomegaly”.
4.2. Pretreatment
The first step is to split text into a set of sentences. We use the Tokeniser module of GATE [14]
in order to split the document into tokens, such as numbers, punctuation, character and words.
Then, the TreeTagger [30] stems these tokens to assign a grammatical category (noun, verb,...)
and lemma to each token. Finally, our system prunes the stop words for each medical article of
the corpus. This process is also carried out on the MeSH thesaurus. Thus, the output of this stage
consists of two sets. The first set is the article’s lemma, and the second one is the list of lemma
existing in the MeSH thesaurus.
Figure 3 outlines the basic steps of the pre-treatment phase.
Figure 3. Pretreatment step
7. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
67
4.3. Term extraction
Automatic term extraction in textual corpora is a challenging task because terms are generally
composed of multiple words. The current term extraction approaches can be categorized as (1)
linguistic [15] [16], (2) statistical [17] [18] and (3) hybrid (linguistic and statistical) [19]
approaches.
The linguistic approach allows defining, identifying and recognising terms looking at pure
linguistic properties, using linguistic filtering techniques aiming to identify specific syntactic term
patterns [20].
The statistical approaches are based on quantitative techniques such as collocation measure. A
collocation, as defined by Choueka [21], is a sequence of adjacent words that frequently appear
together.
The hybrid approaches takes into account both linguistic and statistical hints to recognise terms.
As mentioned above, a term can be either simple or composed. To extract the simple term, we
project the Mesh thesaurus on the document by applying a simple matching. More precisely, each
lemmatized term in the document is matched with the canonical form or lemma of MeSH terms.
To recognize the composed terms, we have chosen to use YateA [22]. YateA (Yet Another Term
ExtrAtor) is an hybrid term extractor developed in the project ALVIS. After text processing,
YateA generates a file composed of two columns: the inflected form of the term and its
frequency. For instance, as shown in figure 4 which describes the result of the term extraction
process by using YateA, the term “exercice physique” occurs 6 times.
Figure 4. An excerpt of the result of YaTeA
8. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
68
4.4. Term weighing
Calculating term importance is a significant and fundamental step in information retrieval process
and it is traditionally determined through term frequency (tf) and inverse document frequency
(IDF).
In this paper we present a novel weighing technique by intuitively interpreting the conceptual
information in the Mesh thesaurus. To determine term importance, this technique exploits its
relationships in Mesh, rather than relying only on a statistical measure. Hence, the term weight is
calculated by using its synonyms occurrences. Term significance is also captured using its
location. More precisely, we assign an additional weight to the terms that are extracted from the
title or the abstract of the document. To do so, we use two measures: the Content Structure
Weight (CSW) and the Semantic Weight (SW).
4.4.1. Content Structure Weight
We can notice that the frequency is not a main criterion to calculate the CSW of the term. Indeed,
the CSW takes into account the term frequency in each part of the document rather than the whole
document. For example, a term of the Title receives a higher importance (*10) than to a term that
appears in the Paragraphs (*2). Table 1 shows the various coefficients used to weight the term
locations. These coefficients were determined in an experimental way in [23].
Table 1. Weighing coefficients
Term location Weight of the location
Title (T)
Keywords (K)
Abstract (A)
Paragraphs (P)
10
9
8
2
The CSW of the term ti in a document d is given as follows:
∑
∑
∈
∈
×
=
P
A,
K,
T,
A
i
P
A,
K,
T,
A
i
i
)
,
,
(f(t
)
,
,
f(t
d)
,
(
A
d
W
A
d
t
CSW
A
(1)
Where:
– WA is the weight of the location A (see Table 1),
– f(ti,d,A) is the occurrence frequency of the term ti in the document d at location A.
For example, the term cancer exists in the document d1683: 1 time in the title, 2 times in the
abstract and 9 times in the Paragraphs,
9
2
1
2
9
8
2
10
1
)
d
,
( 1683
+
+
×
+
×
+
×
=
cancer
CSW .
9. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
69
4.4.2. Semantic Weight
The Semantic Weight of term ti in the document d depends on its synonyms existing in the set of
Candidate Terms (CT(d)) generated by the term extraction step. To do so, we use the Synof
function that associates for a given term ti, its synonyms among the set of CT(d).
Formally the measure SW is defined as follows:
))
(
,
t
(
)
,
f(g
d)
,
(
i
))
(
,
t
(
g
i
i
d
CT
Synof
d
t
SW
d
CT
Synof
∑
∈
= (2)
For a given term ti, we have on the one hand its Content Structure Weight (CSW(ti, d)) and on the
other its Semantic Weight (SW(ti, d)), its Local Weight (LW(ti, d)) is determined as follows:
2
d)
,
(
d)
,
(
d)
,
( i
i
i
t
SW
t
CSW
t
LW
+
= (3)
By examining the equation 3, we can notice that the terms (simple or composed) are weighted by
the same way. Despite the several works dealing with the weighing of composed terms, there is
so far no weighing technique shared by the community [24]. In our approach, we applied the
weighing method proposed by [25]. According to [25], for a term t composed of n words, its
frequency in a document depends on the frequency of the term itself, and the frequency of each
sub-term. For this purpose, it proposes the measure cf which is defined as follows:
)
,
(
.
)
(
)
(
)
,
(
d)
,
(
)
(
st
d
st
f
t
length
st
length
d
t
f
t
cf
t
subterms
∑
∈
+
= (4)
– f(t; d): the occurrences number of t in the document d.
– Length(t) represents the number of words in the term t.
– subterms(t) is the set of all possible terms MeSH which can be derived from t.
For example, if we consider a term “cancer of blood”, knowing that “cancer” is itself also a
MeSH term, its frequency is computed as:
)
,
(
.
2
1
)
,
blood
of
cancer
(
d)
blood,
of
cancer
( d
cancer
f
d
f
cf +
=
Consequently, in an attempt to take into account the case of composed terms, we calculate the
csw measure as follows:
)
,
(
.
)
(
)
(
)
,
,
(f(t
)
,
,
f(t
d)
,
(
)
(
st
P
A,
K,
T,
A
i
P
A,
K,
T,
A
i
i d
st
f
t
length
st
length
A
d
W
A
d
t
CSW
i
t
subterms
A
i
∑
∑
∑
∈
∈
∈
+
×
= (5)
Where: f(st,d) is the occurrences number of st in the document d.
4.5. Selection of descriptors
A term MeSH may be located in different hierarchies at various levels of specificity, which
reflects its ambiguity. As an illustration, figure 5 depicts the term “Pain”, which belongs to four
10. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
70
branches of three different hierarchies (descriptors) whose the most generic descriptors are:
Nervous System Disease (C10); Pathological Conditions, Signs and Symptoms (C23);
Psychological Phenomena and Processes (F02); Musculoskeletal and Neural Physiological
Phenomena (G11).
Figure 5. Term Pain in MeSH
In the last years, due to the amount of ambiguous terms and their various senses used in
biomedical texts, term ambiguity resolution becomes a challenge for several researchers [26] [27]
[28]. Differently from the proposed works in the literature, our method assign the appropriate
descriptor related to a given term by using the language model approach.
For an ambiguous term, the task of WSD consists in answering the question: among its several
senses, which is the best descriptor that can represent this term. The task of the WSD system is
then to estimate, for each candidate descriptor MeSH, which is most likely to be the ideal
concept. Hence, we estimated a language model for each document in the collection and for a
MeSH descriptor we rank the documents with respect to the likelihood that the document
language model generates the MeSH descriptor. Our basic assumption behind descriptor selection
is that “for a given term ti having several senses in the document d, the best descriptor is the one
which has the highest probability to be generated by the model of this document”.
To do so, we estimated P(des|d) the probability of generating the descriptor des according to the
document model. This probability is estimated by using the language model approach proposed
by [29].
In the following, after a brief presentation of language model approach proposed by [29], we
detail our WSD method.
11. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
71
4.5.1. Language modeling approach
As is the case with the majority of language modeling approaches, [29] assumes that the user has
a reasonable ideal of the terms that are likely to appear in the ideal document that can satisfy his
information need, and the query terms the user chooses can distinguish the ideal document from
the rest of the collection. The query is then generated as the piece of text representative of the
ideal document. For each document d in the collection, [29] calculates the probability of
generating the user query by using this equation:
)
(
)
(
).
1
((
)
( D
t
P
d
t
P
Q
d
P i
i
Q
ti
λ
λ +
−
=
∈
C (7)
Where:
– )
( D
t
P i represents the probability of generating ti from corpus D.
∑
∈
=
D
t
j
i
i
j
D
t
df
D
t
df
D
t
P
)
,
(
)
,
(
)
( (8)
)
,
( D
t
df J the number of documents which term tj is occurs in.
– )
( d
t
P i represents the probability of generating ti from a document d.
d
d
t
tf
d
t
P i
i
)
,
(
)
( = (9)
)
,
( d
t
tf i is the frequency of the term ti in the document d.
Finally, documents are ranked according to their estimated degree or probability of usefulness for
the user.
4.5.2. WSD using language modelling
In our WSD approach, to determine for an ambiguous term, its best descriptor, we have adapted
the language model of [29] by substituting the query by the Mesh descriptor. Thus, we infer a
language model for each document and rank Mesh descriptors according to their probability of
producing each one given this model. We would like to estimate P(des|d), the probability of
generation a Mesh descriptor des given the language model of document d.
For a collection (D), a document (d) and a MeSH descriptor (des) composed of n concepts, the
probability P(des|d) is done by :
)
(
)
(
).
1
((
)
(
)
,
(
D
c
P
d
c
P
d
des
P j
j
d
des
es
relatedtoD
c
k
k
j
λ
λ +
−
= ∏
∈
(10)
RelatedtoDes (respectively RelatedtoCon) is the function that associates for a given descriptor des
(respectively concept con) and a document d, the concepts (respectively terms) MeSH which are
related to des (respectively con) in d.
12. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
72
In the equation 10, we need to estimate two probabilities:
• )
( D
c
P j : the probability of observing the concept cj in the collection D.
∑
∈
=
D
c
D
c
f
D
c
f
D
c
P
'
)
,
'
(
)
,
(
)
(
)
,
(
)
,
(
)
,
(
D
t
df
D
c
f
d
c
on
relatedtoC
t
l
j
j
l
∑
∈
=
df(t,D): df (document frequency) is the number of documents which term t occurs
in D.
• P(cj|d): the probability of observing a concept cj in a document d:
)
(
)
,
(
)
(
d
concepts
d
c
f
d
c
P =
)
,
(
)
,
(
)
,
(
d
t
LW
d
c
f
d
c
on
relatedtoC
t
l
j
j
l
∑
∈
=
LW(t,d) is determined by using the equation 3.
Based on this approach, to assign the appropriate sense (Best Descriptor (BD)) related to an
ambiguous term (ti) in the context of document (dj), we must go through these steps:
1. Compute the descriptor relevance score:
Let { }
in
dj
i
dj
i
dj
i
dj
i
dj des
des
des
des
senses ...
,
, 3
2
1
= : the set of descriptors MeSH that can represent
the term ti in the document dj. For each descriptor desk existing in this set, we need to
measure its ability to represent the term (ti) in the document (dj). To do so, we calculate
P(desk |dj) (see equation 10).
2. Selection of the best descriptor:
The best descriptor (BD to retain is the one which maximizes P(desk |dj):
)
(
)
,
( max j
k
senses
des
j
i d
des
p
d
t
BD
dj
i
k ∈
= (13)
Finally, the Semantic Index (SI) of document d, is generated as follows:
)
,
(
)
(
)
(
j
i
d
CT
t
j d
t
BD
d
SI
j
i
U
∈
= (14)
5. EXPERIMENTAL EVALUATION
In order to assess the feasibility of our indexing approach, we have carried out a BIOmedical
Indexing System (BIOINSY). Our indexing system was evaluated by intensive experiments. In
order to make clear these experiments, we first present shortly the training data sets. After that,
13. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
73
we describe the experimental process and the techniques used for validation. Finally, we discuss
the obtained results.
5.1. Training data sets
To evaluate our indexing approach we built a corpus consisting of 500 scientific articles selected
from CISMEF6
. In this collection, we tried to have a diversified base in terms of content. This
corpus was carried out with a set of medical keywords (cancer, diabetes, pregnancy, fever…)
used on the search engine of CISMEF. Analysis of this corpus revealed about 716,000 words.
Each document of this corpus has been manually indexed by five professional indexers in the
CISMEF team in order to provide a description called “notice”.
Figure 6 shows an example of notice for resource “diabete de type 2”.
A notice is mainly composed of:
– General description: title, authors, abstract…
– Specific description: MeSH descriptors that describe article content.
In this evaluation, the notice (manual indexing) is used as a reference. In fact, performance
evaluation was done over the same set of 500 articles, by comparing the set of MeSH descriptors
retrieved by our system against the manual indexing (presented by the professional indexers).
Figure 6. CISMeF notice for resource “Diabete de type 2”
5.2. Experimental process
In this experimental process, three measures have been used: precision, recall and F-measure.
Precision corresponds to the number of indexing concepts correctly retrieved over the total
number of retrieved concepts.
Recall corresponds to the number of indexing concepts correctly retrieved over the total number
of expected concepts.
F-measure combines both precision and recall with single measure.
6
Catalogue et Index des Sites Médicaux Francophones
14. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
74
FN
TP
TP
call
+
=
Re
FP
TP
TP
ecision
+
=
Pr
recall
ecision
measure
F
1
)
1
(
Pr
1
1
×
−
+
×
=
−
α
α
Where:
• TP: (true positive) is the number of MeSH concepts correctly identified by the system
and found in the manual indexing.
• FN: (false negative) is the MeSH concepts that the system failed to retrieve in the corpus.
• FP: (false positive) is the number of MeSH concepts retrieved by the system but were
not found in the manual indexing.
5.3. Experimental results
To evaluate the effectiveness of our weighing method, we carried out three sets of experiments:
• Experiment 1: classical term weighing: the term is weighed by using the measures
(tf/idf).
• Experiment 2: semantic term weighing: the term is weighed by using the measures CSW
(see equation and SW (see equations 1 and 2).
• Experiment 3: composed term weighing: in this case the measures CSW is calculated by
using the equation 5.
Table 2 shows the precision (P) and the recall (R) obtained by our system BIOINSY at fixed
ranks 1 through 10 in each case cited above.
Table 2. Precision and recall generated by BIOINSY
Rank Experiment1(P/R) Experiment 2(P/R) Experiment 3(P/R)
1
4
10
17,96/20,67
31,19/18,81
48,96/9,14
19,87/22,04
51,98/20,13
69,85/8,29
21,25/24,96
45,15/20,13
68,72/9,08
In order to make clear these experimental results, we propose the figure 7 which presents the F-
measure value generated by BIOINSY at ranks 1, 4 and 10.
15. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
75
Figure 7. The F-measure value generated by BIOINSY at ranks 1, 4 and 10 in three sets of experiments
As shown in figure7, at all ranks, our semantic based weighing approach (case 2) is really
significant compared to the classical term weighing method (case 1). For instance by using
YateA, at rank 4, for recall BIOINSY displayed 31, 19% in the case 1 and 51, 98% in the case 2.
The obtained results confirm the well interest to integrate the composed term technique in the
process weighing. In most of cases (rank 1 and 10), the experiment 3 shows the best F-measure
value. The experimental results highlight the performance of weighing method proposed in this
paper.
6. COMPARISON OF BIOINSY WITH OTHERS TOOLS
To prove the effectiveness of our indexing method, we compared the BIOINSY to other medical
indexing systems. We evaluate the performance of six indexing systems (MetaMap, EAGL,
KNN, MTI and BIOINSY) in terms of generating the manual MeSH annotations.
For this evaluation, we used the same corpus7
used by [33] composed of 1000 random MEDLINE
citations.
Table 3 shows the results generated by indexing systems using the title of a 1000 random
MEDLINE citations.
Table 3. Precision and MAP(Mean Average Precision) generated by the indexing systems using title as
input
Indexing system Precision at rank10 (P@10) MAP
MTI
MetaMap
EAGL
KNN
BIOINSY
0.18
0.17
0.18
0.43
0,39
0,16
0,14
0,17
0,47
0,43
7
The corpus can be downloaded in (http://www.ebi.ac.uk/∼triesch/meshup/testset_v1.xml).
0
5
10
15
20
25
30
35
Rank 1 Rank 4 Rank 10
F-measure
Experiment 1
Experiment 2
Experiment3
16. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
76
Table 4 shows the results generated by indexing systems with the title and abstract of a 1000
random MEDLINE citations.
Table 4. Precision and MAP(Mean Average Precision) generated by the indexing systems using title and
abstract as input
Indexing system Precision at rank10 (P@10) MAP
MTI
MetaMap
EAGL
KNN
BIOINSY
0.32
0.19
0.21
0.45
0,30
0.25
0.16
0.19
0.50
0,26
Figure 8 illustrates the obtained results by the five indexing systems on the 1000 random MEDLINE
citations.
Figure 8. Experimental results generated by the five indexing systems
Our system BIOINSY serves as the baseline against which the other systems are compared.
Both indexing systems MetaMap and EAGL perform worse than BIOINSY in all metrics. Indeed,
MetaMap performs similarly to or slightly worse than EAGL when presented with the title only
or both title and abstract of the citation to index.
MTI performs worse than BIOINSY when the title was available for indexing. For example,
when title used as input, the value of P10 generated by MTI is equal to 0,18. Concerning
BIOINSY, it generates 0,39 as value of P10. When using title and abstract, MTI performs
similarly to or slightly better than BIOINSY in terms of MAP and P10.
By using title as input, KNN and BIOINSY echoed very similar performance. Given the title and
abstract of a citation, KNN shows the best results in all metrics.
0,00
0,10
0,20
0,30
0,40
0,50
0,60
P10 title MAP title P10 (title and
abstract)
MAP (title and
abstract)
MTI
MetaMap
EAGL
KNN
BIOINSY
17. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
77
The obtained results confirm the well interest to use the language modeling approach in the
conceptual indexing process.
7. CONCEPTUAL RETRIEVER
In this section, we try to answer the following question: Can our conceptual indexing approach
(described and evaluated above) improve the information retrieval process. The overview of this
section is as follows. In subsection 7.1 we will present the test collection. In subsection 7.2 we
will describe the experimental setup. In subsection 7.3, the experimental results will be analysed
and discussed.
7.1. Test collection
To evaluate the retrieval effectiveness based on our conceptual indexing method, we use the
ImageCLEF8
med 2007 collection. Started from 2004, the ImageCLEFmed (medical retrieval
task) aims at evaluating the performance of medical information systems, which retrieve medical
information from a mono or multilingual image collection.
This corpus [31] is based on a dataset containing images from the Casimage, MIR, PEIR,
PathoPIC, CORI, myPACS and Endoscopic collections. For each image of this corpus, a textual
description called diagnosis is attributed. An overview of databases used in ImageCLEFmed
2007is shown in Table 5.
Table 5. details of the ImageCLEFmed 2007 collection
Collection
Name
Cases
Images Annotations Annotations by Language
Casimage
MIR
PEIR
PathoPIC
myPACS
Endoscopic
Total
2076
407
32319
7805
3577
1496
47680
8725
1177
32319
7805
15140
1496
66662
2076
407
32319
15610
3577
1496
55485
French – 1899, English – 177
English – 407
English –32319
German – 7805, English – 7805
English –3577
English –1496
French – 1899, English – 45781,
German – 7805
7.2. Experimental setup
The ImageCLEF data contains the qrels file (TREC format) which specifies the set of relevant
images to a given query. In our indexing method we are interested by the textual document.
Hence, to evaluate our approach we assume that “If a query is relevant to an image then it is also
relevant to its textual description (diagnosis)”.
This evaluation process is structured around the following steps:
8
CLEF (Cross Language Evaluation Forum)
18. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
78
• Indexing of diagnosis and queries:
The indexing process is carried out on the diagnosis and queries. Thus, documents and
eventually queries are expanded with descriptors identified by our conceptual indexing
method.
The documents and the queries are presented as follows:
)
,...
,
(
d 2
1
j
n
j
j
j w
w
w
=
r
• Calculation of Retrieval Status Value (RSV (q, d)):
The relevance score of the document dj with respect to the query q is given by:
)
(
)
(
)
,
( des
IDF
des
TF
d
q
RSV Q
des J
j ×
= ∑ ∈
Where:
- TFj: the normalized term frequency of the current descriptor in document dj.
- IDF: the normalized inverse document frequency of the current descriptor in the
collection.
7.3. Results and discussion
In order to assess the validity of our method, we compared the results of our indexing system
BIOINSY to official runs in medical retrieval task 2007.
Table 6 summarizes the results obtained by the participants in medical retrieval task 2007.
Table 6: The comparison of our system with official runs participated in ImageCLEF9med 2007
Run P@5 MAP
LIG-MRIM-LIGMU (the best)
OHSU (the second)
IPAL4 (median)
miracleTxtFRT (median)
IRIT RunMed1 (the worst)
Our system
0.44
0.42
0.39
0.43
0,05
0,38
0,32
0,27
0,27
0,17
0,04
0,24
By examining the table 6, we can note that the results generated by our system close to those of
the best run (LIG-MRIM-LIGMU). Thus, we conclude that our conceptual indexing approach
proposed in this paper would significantly improve the biomedical IR performance.
8. CONCLUSION
The work developed in this paper outlined a concept language model using the Mesh thesaurus
for representing the semantic content of medical articles.
Our proposed conceptual indexing approach consists principally of three main steps. At the first
step (Term extraction), being given an article, MeSH thesaurus and the NLP tools, the system
9
CLEF (Cross Language Evaluation Forum)
19. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
79
BIOINSY extracts two sets: the first is the article’s lemma, and the second is the list of lemma
existing in the MeSH thesaurus. After that, these sets are used in order to extract the Mesh terms
existing in the document. At step 2, these extracted terms are weighed by using the measures
CSW and SW that intuitively interprets MeSH conceptual information to calculate the term
importance. The step 3 aims to recognize the MeSH descriptors that represent the document by
using the language model.
In order to assess its feasibility, our indexing approach was experimented on through training data
sets containing 500 medical articles. An experimental evaluation and comparison of BIOINSY
with others indexing tools confirms the well interest to use the language modelling approach in
the conceptual indexing process. The performance of our approach is also close to the average of
official runs in CLEFmedical retrieval task 2007 and is comparable to the best run.
Our future work aims at incorporating a kind of semantic smoothing into the langage modeling
approach. We also plan to use several semantic resources in the indexing process. We believe that
multi-terminology based indexing approach can enhance the IR performance.
REFERENCES
[1] Pouliquen, B.: Indexation de textes médicaux par indexation de concepts, et ses utilisations. PhD
thesis, Universit Rennes 1 (2002)
[2] Neveol, A.: Automatisation des taches documentaires dans un catalogue de santé en ligne. PhD thesis,
Institut National des Sciences Appliques de Rouen (2005)
[3] Dinh, D., Tamine, L.: Biomedical concept extraction based on combining the content-based and word
order similarities. In: SAC. (2011) 1159–1163
[4] A.Aronson, J.Mork, C.S., W.Rogers: The nlm indexing initiative’s medical text indexer. In: Medinfo.
(2004)
[5] A.Aronson: Effective mapping of biomedical text to the umls metathesaurus: the metamap program.
In: AMIA. (2001) 17–21
[6] W.Kim, A.Aronson,W.: Automatic mesh term assignment and quality assessment. In: AMIA. (2001)
[7] P.Lenoir, R.Michel, C., G.Chales: Réalisation, développement et maintenance de la base de données
a.d.m. In: Médecine informatique. (1981)
[8] Zakos, J., Verma, B.: Concept-based term weighting for web information retrieval. Computational
Intelligence and Multimedia Applications, International Conference on 0 (2005) 173–178
[9] Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: how to tell a pine
cone from an ice creamcone. In: SIGDOC ’86: Proceedings of the 5th annual international conference
on Systems documentation, New York, NY, USA, ACM (1986) 24–26
[10] Mihalcea, R.: Unsupervised large-vocabulary word sense disambiguation with graph-based
algorithms for sequence data labeling. In: Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language Processing. HLT ’05, Stroudsburg, PA,
USA, Association for Computational Linguistics (2005) 411–418
[11] Lee, Y.K., Ng, H.T., Chia, T.K.: Supervised word sense disambiguation with support vector
machines and multiple knowledge sources. In: Senseval-3: Third International Workshop on the
Evaluation of Systems for the Semantic Analysis of Text. (2004) 137–140 12
[12] Liu, H., Teller, V., Friedman, C.: A multi-aspect comparison study of supervised word sense
disambiguation. J Am Med Inform Assoc (11) 320–31
[13] Yarowsky, D.: Unsupervised word sense disambiguation rivalling supervised methods. In: In
Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. (1995)
189–196
[14] Cunningham, M., Maynard, D., Bontcheva, K., V.Tablan: Gate: A framework and graphical
development environment for robust NLP tools and applications. ACL (2002)
20. International Journal of Software Engineering & Applications (IJSEA), Vol.3, No.3, May 2012
80
[15] Bourigault, D.: Extraction et structuration automatique de terminologie pour l’aide l’acquisition des
connaissances partir de textes. In: 9me congrès Reconnaissance des Formes et Intelligence Artificielle
(RFIA’94), Paris. (1994) 397–408
[16] Jacquemin, C., Royaut, J.: Retrieving terms and their variants in a lexicalized unification-based
framework. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and
development in information retrieval. (1994) 132–141
[17] drouin, P., LADOUCEUR, J.: (L’identification automatique de descripteurs complexes dans des
textes de spécialité)
[18] Oueslati, R.: Aide l’acquisition de connaissances partir de corpus. PHD thesis, Université Louis
Pasteur (1999)
[19] daille, B.: Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et
filtres linguistiques. PhD thesis, Université Paris 7 (1994)
[20] Pazienza, M., Pennacchiotti, M., Zanzotto, F.: Terminology extraction: An analysis of linguistic and
statistical approaches. In Sirmakessis, S., ed.: Knowledge Mining Series: Studies in Fuzziness and
Soft Computing. Springer Verlag (2005)
[21] Choueka, Y.: (Looking for needles in a haystack or locating interesting collocationnal expressions in
a large textual database)
[22] Aubin, S., Hamon, T.: Improving term extraction with terminological resources. In: Advances in
Natural Language Processing. Volume 4139 of Lecture Notes in Computer Science. Springer Berlin /
Heidelberg (2006) 380–387
[23] Gamet, J.: Indexation de pages web. Report of dea, university de Nantes (1998)
[24] Baziz, M., Boughanem, M., Aussenac-Gilles, N., Chrisment, C.: Semantic cores for representing
documents in ir. In: Proceedings of the 2005 ACM symposium on Applied computing. SAC ’05,
ACM (2005) 1011–1017
[25] Baziz, M.: Indexation conceptuelle guide par ontologie pour la recherche d’information. PhD thesis,
Univ. of Paul sabatier (2006)
[26] Andreopoulos, B., Alexopoulou, D., Schroeder, M.: Word sense disambiguation in biomedical
ontologies with term cooccurrence analysis and document clustering. IJDMB 2 (2008) 193–215
[27] Stevenson, M., Guo, Y., Gaizauskas, R., Martinez, D.: Knowledge sources for word sense
disambiguation of biomedical text. In: BioNLP ’08: Proceedings of theWorkshop on Current Trends
in Biomedical Natural Language Processing, Association for Computational Linguistics (2008) 80–87
[28] Dinh, B., Tamine, L.: Sense-based biomedical indexing and retrieval. In: NLDB. (2011) 24–35
[29] Hiemstra, D.: Using Language Models for Information Retrieval. PhD thesis, University of Twente
(2001)
[30] H.Schmid: Probabilistic part-of-speech tagging using decision trees. In: International Conference on
New Methods in Language Processing, Manchester (1994)
[31] Mller, H., Deselaers, T., Deserno, T., Clough, P., Kim, E., Hersh, W.: Overview of the imageclefmed
2006 medical retrieval and annotation tasks. In: In: CLEF 2006 Proceedings. Lecture Notes in
Computer Science (2006) 595–608
[32] Ruch, P. Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics,
22, (2006) 658–664.
[33] Trieschnigg D., Pezik P., Lee V., Kraaij W., de Jong F., and Rebholz-Schuhmann D. MeSH Up:
Effective MeSH Text Classification and Improved Document Retrieval. Bioinformatics 25(11),
(2009), 1412–1418.