This document discusses how bioinformatics research can benefit from techniques in information retrieval. It provides background on bioinformatics, information retrieval, and how the fields intersect. Specifically, it describes how indexing, searching, filtering, mining and categorizing large amounts of bioinformatics data and publications can help with tasks like acquiring, analyzing, organizing and storing biological information. The document also presents several case studies of specific tools and systems that apply IR techniques in bioinformatics.
Driving Behavioral Change for Information Management through Data-Driven Gree...
Β
Bioinformatics Meets Information Retrieval
1. Bioinformatics Meets
Information Retrieval
State of the Art and a Case Study
Eloisa Vargiu
Intelligent Agents and Soft-Computing Group
Dept. of Electrical and Electronic Engineering
University of Cagliari, Italy
February 16, 2011 β Valencia (Spain) email: vargiu@diee.unica.it
2. My Background
ο 2000 β 2004 ο 2004 β 2009
ο Automatic planning ο Bioinformatics
ο Classic domains: HW[] ο Protein secondary structure
ο Dynamic domains: HIPE prediction: MASSP3 and
GAME/SSP
ο 2000 - β¦
ο 2006 - β¦
ο Multiage s te
nt ys ms
ο Information Retrieval
ο A Personalized Adaptive and
Cooperative Multiagent ο Hierarchical text
System: PACMAS categorization: PF and TSA
ο A generic architecture to ο Recommender systems and
perform information retrieval contextual advertising: ConCA
tasks: X.MAS
February 16, 2011 β Valencia (Spain)
3. Outline
ο Context and Mission
ο Why Bioinformatics Needs Information Retrieval
ο Bioinformatics Meets Information Retrieval
ο Case Study: Retrieving and Filtering Bioinformatics Publications
ο Conclusions
February 16, 2011 β Valencia (Spain)
5. Web Evolution
ο Web 1.0 1993
ο Source of information
ο Personal homepages
ο Web 2.0 2004
ο Social networks
ο (Micro)Blogging
ο Web 3.0 2005
ο Semantic web
ο Web composition
February 16, 2011 β Valencia (Spain)
6. Web Evolution and Bioinformatics
ο A long time ago...
ο Data was stored in local DBs
ο Data was shared as flat files
ο Biologists worked alone or in small groups
February 16, 2011 β Valencia (Spain)
7. Web Evolution and Bioinformatics
ο Today...
ο Online repositories
ο The major sources of nucleotide sequence are the ones belonging to the
International Nucleotide Sequence Database Collaboration
ο DDBJ (DNA DataBank of Japan)
ο EMBL (European Molecular Biology Laboratory)
ο GenBank (NIH genetic sequence database)
ο Web services
ο Basic bioinformatics services are
classified by the EBI into three categories
ο SSS (Sequence Search Services)
ο MSA (Multiple Sequence Alignment)
ο BSA (Biological Sequence Analysis)
February 16, 2011 β Valencia (Spain)
8. Web Evolution and Scientific
Publications
ο A long time ago...
ο Publications were consulted at the library
ο Just two or three relevant available journals
ο Manual selection of relevant publications
February 16, 2011 β Valencia (Spain)
9. Web Evolution and Scientific
Publications
ο Today...
ο Online journals
ο Online conference proceedings
ο Publications are often available for free
ο Manual selection of relevant publications
becomes unfeasible
February 16, 2011 β Valencia (Spain)
10. As a Consequence...
ο Unstructured information
ο Information overload
ο Personalized information selection and input imbalance
February 16, 2011 β Valencia (Spain)
11. Our Mission
ο To cope with
ο Unstructured information, classifying documents according to a
given taxonomy
ο Information overload, filtering information to reduce redundancy
ο Personalized information selection and input imbalance, filtering
information according to user preferences
ο Case study
ο Retrieving and filtering bioinformatics publications
February 16, 2011 β Valencia (Spain)
12. Research Topics
ο Information Retrieval
ο Bioinformatics
February 16, 2011 β Valencia (Spain)
13. Information Retrieval
Information Retrieval (IR) deals with the representation,
Information Retrieval (IR) deals with the representation,
storage, organization of, and access to information items.
storage, organization of, and access to information items.
The user must first translate this information need into a query
The user must first translate this information need into a query
which can be processed by an IR system.
which can be processed by an IR system.
Given the user query, the key goal of an IR system is to retrieve
Given the user query, the key goal of an IR system is to retrieve
information which might be useful or relevant to the user.
information which might be useful or relevant to the user.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
New York: Addison-Wesley, 1999.
February 16, 2011 β Valencia (Spain)
14. Main IR Topics
ο Indexing
ο Search and Web Search
ο Information Filtering
ο Text Mining
ο Text Categorization and Hierarchical Text Categorization
February 16, 2011 β Valencia (Spain)
15. Bioinformatics
Bioinformatics is the field of science in which biology,
Bioinformatics is the field of science in which biology,
computer science, and information technology merge to form a
computer science, and information technology merge to form a
single discipline.
single discipline.
The ultimate goal of the field is to enable the discovery of new
The ultimate goal of the field is to enable the discovery of new
biological insights as well as to create a global perspective from
biological insights as well as to create a global perspective from
which unifying principles in biology can be discerned.
which unifying principles in biology can be discerned.
National Center for Biotechnology Information (NCBI),
http://www.ncbi.nlm.nih.gov/.
February 16, 2011 β Valencia (Spain)
16. Main Bioinformatics Research Areas
ο Sequence analysis
ο Genome annotation
ο Computational evolutionary biology
ο Analysis of gene expression
ο Analysis of protein expression
ο Analysis of mutations in cancer
ο Comparative genomics
ο Modelling biological systems
ο Prediction of protein structure
ο Molecular interaction
February 16, 2011 β Valencia (Spain)
17. Why Bioinformatics
Needs
Information Retrieval
February 16, 2011 β Valencia (Spain)
18. Does Bioinformatics Need IR?
ο Bioinformatics is concerned with researching, developing and
applying tools and methods to acquire, analyse, organize and
store biological and medical data
ο Indexing and search techniques may help in the task of acquiring
ο Information filtering, text mining and text categorization
techniques may be useful to the analysis of data
ο Text categorization, with particular reference to hierarchical text
categorization, may be used in the organization and storage tasks
February 16, 2011 β Valencia (Spain)
19. Bioinformatics Data
ο A very huge amount of of data to be
ο Indexed
ο Searched for in large databases or on the web
ο Filtered according to users' preferences
ο Text mined
ο Categorized according to its textual content
February 16, 2011 β Valencia (Spain)
20. DB Indexing
ο Why
ο Data types are relegated to blob and unstructured text fields
ο Few results in building persistent access paths to support fast
retrieval methods
ο Genomic datasets in public repositories are annotated with free-text
fields describing the pathological state of the studied sample
ο Annotations are not mapped to concept in any ontology
February 16, 2011 β Valencia (Spain)
21. DB Indexing
ο Who
ο MoBIoS β Molecular Biological Information System
ο What
ο A specialized database management system
ο The storage manager is based on metric-space indexing
ο Query language entails biological data types
ο Where
ο Sequence homology: local alignment and mutations
D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to
Support Biological Discovery. Proceedings of the International
Conference on Scientific and Statistical Database Management
Systems, 2003.
February 16, 2011 β Valencia (Spain)
22. DB Indexing
ο Who
ο --
ο What
ο Ontology-driven indexing of public datasets for translational
bioinformatics
ο Methods to map text annotations of gene expression datasets to
concept in the UMLS
ο Where
ο Gene Expression Omnibus
ο Standford Tissue Microarray Database
N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A.
.
Musen. Ontology-driven indexing of public datasets for translational
bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009.
February 16, 2011 β Valencia (Spain)
23. Web Indexing
ο Why
ο Most often sequence retrieval tools and sequence analysis tools are
separated
ο The usage of sequence DBs is often general and limited to
keyword searching and entry retrieval
ο Discovering and accessing the appropriate bioinformatics resource
for a specific task has become increasingly important
February 16, 2011 β Valencia (Spain)
24. Web Indexing
ο Who
ο SIRW β A Web Server for Simple Indexing and Retrieval System
ο What
ο A WWW interface to the Simple Indexing and Retrieval (SIR)
system to parse and index flat file DBs
ο A framework for doing sequence analysis for selected biological
sequences
ο Where
ο Sequence analysis: motif pattern searches
C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval
System that combines sequence motif searches with keyword searches.
Nucleic Acids Research, 31(13). pp. 3771-3774, 2003.
February 16, 2011 β Valencia (Spain)
25. Web Indexing
ο Who
ο BIRI - BIoinformatics Resource Inventory
ο What
ο An approach for automatically discovering and indexing public
bioinformatics resources
ο Where
ο The scientific literature
G. de la Calle, M. GarcΓa-Remesal, S. Chiesa, D. de la Iglesia, V.
Maojo. BIRI: a new approach for automatically discovering and
indexing available public bioinformatics resources from the literature.
BMC Bioinformatics, Oct 7;10:320, 2009.
February 16, 2011 β Valencia (Spain)
26. DB Search
ο Why
ο A wealth of bioinformatics tools and databases has been created
over the last decade and most are freely available
ο Often it is desired to visualize the database hits stacked according
to the query sequence
ο There is no inventory presenting an up-to-date and easily
searchable index of all these resources
February 16, 2011 β Valencia (Spain)
27. DB Search
ο Who
ο MView β Multiple alignment Viewer
ο What
ο A tool for converting the result of a sequence database search into
the form of a coloured multiple alignment of hits stacked against
the query
ο Where
ο Multiple alignment
N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible
.
database search or multiple alignment viewer. Bioinformatics, 14(4), pp.
380-381, 1998.
February 16, 2011 β Valencia (Spain)
28. DB Search
ο Who
ο BioWareDB
ο What
ο An extensive and current catalog of software and DBs of relevance
to researchers in the field of biology and medicine
ο Where
ο Current and available biomedical computing resources
M.W. Matthiessen. BioWareDB: the biomedical software and database
search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003.
February 16, 2011 β Valencia (Spain)
29. Web Search
ο Why
ο Today, scientists can easily post their research findings on the Web
or compare their discoveries with previous work
ο Manually maintaining a wrapper library will not scale to
accommodate the growth of genomics data sources on the Web
February 16, 2011 β Valencia (Spain)
30. Web Search
ο Who
ο ---
ο What
ο An automated system able to find, classify, and wrap new sources
without constant human intervention
ο Where
ο Distributed genomics data sources
D. Rocco and T. Critchlow. Automatic discovery and classification of
bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933,
2003.
February 16, 2011 β Valencia (Spain)
31. Web Search
ο Who
ο GoPubMed
ο What
ο An ontology-based literature search applied to Gene Ontology
(GO) and PubMed
ο Where
ο Scientific literature
R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed:
ontology-based literature search applied to gene ontology and PubMed.
In Proceedings of German Bioinformatics Conference, pp. 169β178,
2004.
February 16, 2011 β Valencia (Spain)
32. Information Filtering
ο Why
ο In the Web 2.0 scenario, users look for collaborative environments,
in which they can meet further users with similar preferences and
needs
ο Researchers need to search for and/or generate specialized datasets
that meet specific requirements
February 16, 2011 β Valencia (Spain)
33. Information Filtering
ο Who
ο ProDaMa-C Protein Dataset Management β Collaborative
ο What
ο A web application aimed at
ο Generating specialized protein structure datasets
ο Favouring the collaboration among researchers
ο Where
ο Protein structures
G. Armano and A. Manconi. A Collaborative Web Application for
Supporting Researchers in the Task of Generating Protein Datasets.
Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A.
Soro, E. Vargiu (eds.), Springer-Verlag, 2011.
February 16, 2011 β Valencia (Spain)
34. Information Filtering
ο Who
ο Gene Recommender
ο What
ο An algorithm that ranks genes according to how strongly they
correlate with a set of query genes
ο Where
ο Analysis of gene expression
A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene
recommender algorithm to identify coexpressed genes. Genome
Research, Aug;13(8), pp. 1828-37, 2003.
February 16, 2011 β Valencia (Spain)
35. Text Mining
ο Why
ο Web-based tools capable of filtering public DBs are more and more
required
ο Interesting and useful information, relevant to the researcher, could
appear in documents (e.g., papers) they have not read and therefore
be missed entirely
ο Of paramount importance to DB search methods is a reliable
means of distinguishing true hits from false hits
ο Biologists construct a pathway by reading a large number of
articles and interpreting them a consistent network, but the link to
the original article is missed
February 16, 2011 β Valencia (Spain)
36. Text Mining
ο Who
ο MedMiner
ο What
ο An Internet text mining tool that filters the literature and presents
the most relevant portions in a well-organized way that facilitate
understanding
ο Where
ο Gene expression profiling
L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N.
Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical
Information, with Application to Gene Expression Profiling.
Biotechniques, Dec;27(6), pp. 1210-4, 1999.
February 16, 2011 β Valencia (Spain)
37. Text Mining
ο Who
ο BioRAT
ο What
ο A research assistant that, given a query,
ο autonomously finds a set of papers
ο reads them
ο highlights the most relevant facts in each
ο Where
ο Scientific literature
D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones.
.
BioRAT: Extracting biological information from full-length papers.
Bioinformatics, 20(17), pp. 3206β3213, 2004.
February 16, 2011 β Valencia (Spain)
38. Text Mining
ο Who
ο SAWTED β Structure Assignment With Text Description
ο What
ο An automated system to filtering DB hits
ο Where
ο Homologues annotation
R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure
assignment with text description-enhanced detection of remote
homologues with automated SWISS-PROT annotation comparisons.
Bioinformatics, Feb;16(2), pp. 125-9, 2000.
February 16, 2011 β Valencia (Spain)
39. Text Mining
ο Who
ο PathText
ο What
ο A system to integrate a pathway visualized, text mining systems
and annotation tools into a seamless environment
ο Where
ο Pathway visualizations
B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S.
Ananiadou, and J. Tsujii. PathText: a text mining integrator for
biological pathway visualizations. Bioinformatics, 26(12), pp. i374-
i381, 2010.
February 16, 2011 β Valencia (Spain)
40. Text Categorization
ο Why
ο Information in text form, such as MEDLINE records, is a greatly
underutilized source of biological information
ο Individual researchers find it difficult to keep up with all the new,
relevant information
ο Systems that extract structured information from natural language
passages have been highly successful in specialized domains
ο Time is ripe for developing such applications for molecular biology
and genomics
February 16, 2011 β Valencia (Spain)
41. Text Categorization
ο Who
ο --
ο What
ο Constructing biological knowledge bases by extracting information
from text sources
ο Where
ο MEDLINE
M. Craven and J. Kumlien. Constructing Biological Knowledge Bases
by Extracting Information from Text Sources. In Proceedings of the 7th
International Conference on Intelligent Systems for Molecular Biology,
1999.
February 16, 2011 β Valencia (Spain)
42. Text Categorization
ο Who
ο Genies
ο What
ο A natural-language processing system for the extraction of
molecular pathways
ο Where
ο Scientific publications
C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies:
.
a natural-language processing system for the extraction of molecular
pathways from journal articles. Bioinformatics, 17, pp. 574β582, 2001.
February 16, 2011 β Valencia (Spain)
43. Hierarchical Text Categorization
ο Why
ο A great deal of genomics information accumulated through years is
available in online text repositories (such as MEDLINE)
ο These resources do not still provide adequate mechanisms for
retrieving the required information
ο Traditional filtering techniques based on keyword search are often
inadequate to express what the user is really searching for
ο Web repositories, such as Medical Subject Headings (MeSH) in
MEDLINE, encompass an underlying taxonomy
February 16, 2011 β Valencia (Spain)
44. Hierarchical Text Categorization
ο Who
ο --
ο What
ο A tool for assisting biologists with literature search for the task of
associating genes with Gene Ontology codes
ο Where
ο MEDLINE
S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text
categorization as a tool of associating genes with gene ontology codes.
In 2nd European Workshop on Data Mining and Text Mining for
Bioinformatics, pp. 26β30, 2004.
February 16, 2011 β Valencia (Spain)
45. Hierarchical Text Categorization
ο Who
ο Pub.MAS
ο What
ο A multiagent system for retrieving and classifying publications
ο Where
ο BMC Bioinformatics
ο PubMed Central
G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
Retrieving Bioinformatics Publications from Web Sources. IEEE
Transactions on Nanobioscience, Special Session on GRID, Web
Services, Software Agents and Ontology Applications for Life Science,
6(2), pp. 104-109, 2007.
February 16, 2011 β Valencia (Spain)
46. Case Study:
Retrieving and Filtering
Bioinformatic Publications
February 16, 2011 β Valencia (Spain)
47. An IR Task
Information Extraction
Online Repositories
Wrapping Information Sources
Extracted Data/Information
Text Categorization
Selected Data/Information Taxonomic Classification of Items
User's Feedback
Adaptive Behavior
February 16, 2011 β Valencia (Spain)
48. Information Extraction
ο Essential to retrieve documents provided by heterogeneous and
distributed sources
A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) :
A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp.
84β93.
February 16, 2011 β Valencia (Spain)
49. Text Categorization
ο It is the task of determining and assigning topical labels to
content
ο Typical approaches to text categorization
ο Statistical
ο Semantic
ο In the last years several researchers have investigated the use of
hierarchies for text categorization
F. Sebastiani. A tutorial on automated text categorisation. Proceedings
of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7-
35, 1999.
February 16, 2011 β Valencia (Spain)
50. Users' Feedback
ο It is aimed at dealing with any feedback provided by the user
ο In semiautomated classification and adaptive filtering we may
expect the user of a classifier to provide feedback on how test
documents have been classified
ο In this case further training may be performed during the
operating phase
February 16, 2011 β Valencia (Spain)
51. Hierarchical Text Categorization
Hierarchical Text Categorization (HTC) deals with problems
Hierarchical Text Categorization (HTC) deals with problems
where categories are organized in the form of a hierarchy.
where categories are organized in the form of a hierarchy.
D. Koller, M. Sahami. Hierarchically classifying documents using very
few words. Proceedings of 14th International Conference on Machine
Learning, pp. 170β 178, 1997.
February 16, 2011 β Valencia (Spain)
52. HTC at a Glance
ο HTC studies how to improve the performances provided by
classical text categorization techniques by exploiting the
knowledge of the taxonomic relationships among classes
February 16, 2011 β Valencia (Spain)
53. Motivations
ο People organize large collections of documents in hierarchies of
topics, or arrange a large body of knowledge in ontologies
ο The main goal of automatic text categorization is to deal with
underlying taxonomies
ο A hierarchical approach can
give benefits in real-world
scenarios, characterized by
information overload and
imbalanced data
February 16, 2011 β Valencia (Spain)
54. HTC Approaches
ο Pachinko machine
ο At each level of the hierarchy
ο The classifier selects the one most probable category
ο It goes down the hierarchy inspecting only the children of the selected
nodes
ο Probabilistic hierarchical local approach
ο At each level of the hierarchy
ο The classifier makes probabilistic decisions
ο It selects the leaf categories on the most probable paths
S. Kiritchenko. Hierarchical text categorization and its application to
bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006.
February 16, 2011 β Valencia (Spain)
55. HTC Approaches
ο Local classifier per node
ο Each classifier decides if forwarding the document to its children
ο Local classifier per parent node
ο Each classifier decides to which subtree(s) the document should be
sent to
ο Local classifier per level
ο The number of outputs per level grows while going down through
the taxonomy
ο Global classifier
ο One classifier is trained, able to discriminate among all categories
C.J. Silla and A. Freitas. A survey on hierarchical classification across
different application domains. Journal of Data Mining and Knowledge
Discovery, 2(1-2), pp. 31-72, 2010.
February 16, 2011 β Valencia (Spain)
56. Progressive Filtering
ο Progressive Filtering (PF) is a simple categorization technique
that operates on hierarchically structured categories
ο A way to implement PF consists of decomposing a given rooted
taxonomy into pipelines, one for of each path that exists between
the root and each node of the taxonomy
ο Each node is a binary classifier able to recognize whether or not
an input belongs to the corresponding class
ο A threshold selection algorithm (TSA) can be run to identify an
optimal, or sub-optimal, combination of thresholds for each
pipeline
A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to
Perform Hierarchical Text Categorization in Presence of Input
Imbalance. Proceedings of International Conference on Knowledge
February 16, 2011 β Valencia (Spain) Information Retrieval, pp. 14-23, 2010.
Discovery and
57. PF at a Glance
ο Starting from the root, each input traverses the taxonomy as a
βtokenβ
February 16, 2011 β Valencia (Spain)
58. Classifiers in PF
ο Partitioning the taxonomy in pipelines gives rise to a set of new
classifiers, each represented by a pipeline
February 16, 2011 β Valencia (Spain)
60. Classifiers in PF
ο The same classifier may have different behaviours, depending on
which pipeline it is embedded
ο Each pipeline can be considered in isolation from the others
February 16, 2011 β Valencia (Spain)
61. Threshold Selection in PF
ο A relevant problem is how to calibrate the threshold of the
binary classifiers embedded by each pipeline in order to
optimize the pipeline behaviour
ο Searching for a optimal or sub-optimal combination of
thresholds in a pipeline can be actually viewed as the problem of
finding a maximum in a utility function F that depends on the
corresponding threshold vector ΞΈ
February 16, 2011 β Valencia (Spain)
62. TSA
ο For each pipeline the best combination of thresholds is
calculated according to a bottom up algorithm that uses two
functions
ο Repair which increases/decreases (β / β the threshold until the
)
utility function reaches a maximum
ο Calibrate which recursively operates downward from the given
classifier by repeatedly calling repair (β / β)
A. Addis, G. Armano, E. Vargiu. A comparative experimental
assessment of a threshold selection algorithm in hierarchical text
categorization. In: Advances in Information Retrieval. The 33rd
European Conference on Information Retrieval (ECIR 2011), 2011
February 16, 2011 β Valencia (Spain)
64. The Prototype
ο MultiAgent Architecture
ο X.MAS
ο Agent Framework
ο JADE
A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent
Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6,
Sixth International Workshop, From Agent Theory to Agent
Implementation, pp. 3β9, 2008.
F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent
Systems with JADE (Wiley Series in Agent Technology). John Wiley
and Sons, 2007.
February 16, 2011 β Valencia (Spain)
65. X.MAS at a Glance
ο Macro-architecture
February 16, 2011 β Valencia (Spain)
66. X.MAS at a Glance
Information Agent
Scheduler Source
ο Micro-architecture
Middle Agent
Scheduler Dispatcher
Filter Agent
Scheduler Actuator
Middle Agent
Scheduler Dispatcher
Task Agent
Scheduler Actuator
Middle Agent
Scheduler Dispatcher
Interface Agent
Scheduler
February 16, 2011 β Valencia (Spain)
68. Pub.MAS
G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
Retrieving Bioinformatics Publications from Web Sources. IEEE
Transactions on Nanobioscience, Special Session on GRID, Web
Services, Software Agents and Ontology Applications for Life Science,
6(2), pp. 104-109, 2007.
February 16, 2011 β Valencia (Spain)
69. Information Extraction
ο It is supported by a set of agents explicitly devoted to
ο wrap the selected information sources
ο encode the extracted documents
ο An information agent wraps BMC Bioinformatics web site
ο HTML wrapper
ο An information agent wraps PubMed Central digital archive
ο Web service wrapper
February 16, 2011 β Valencia (Spain)
70. Hierarchical Text Categorization
ο The PF approach previously described has been implemented
ο Document has been encoded to
ο remove all non-informative words
ο remove the most common morphological and inflexional suffixes
ο select the relevant features
ο generate a feature vector for each document
ο Classification is performed by wkNN classifiers
ο the score is assigned using non parametric density estimation of the
β a posterioriβ probability
February 16, 2011 β Valencia (Spain)
71. The Adopted Taxonomy
P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A.
.
Brass. An ontology for bioinformatics applications, Bioinformatics,
15(6), pp. 510β520, 1999.
February 16, 2011 β Valencia (Spain)
74. Users' Feedback
ο User feedback is aimed at dealing with any feedback provided
by the user
ο Two solutions have been experimented
ο training an ANN
ο using a kNN classifier
February 16, 2011 β Valencia (Spain)
75. Experiments
ο Different kinds of tests have been performed, each aimed at
highlighting a specific issue
ο we estimated the (normalized) confusion matrix for each classifier
belonging to the highest level of the taxonomy
ο we studied the impact of taking into account pipelines of
classifiers, also trying to assess whether a residual independence
was in fact present
ο we assessed the solution devised for implementing userβs feedback,
based on the k-NN technique
February 16, 2011 β Valencia (Spain)
76. Experiments
ο Tests have been performed using selected publications extracted
from the BMC Bioinformatics site and from the PubMed Central
digital archive
ο Publications have been classified by an expert of the domain
according to the proposed taxonomy
ο For each item of the taxonomy, a set of about 100-150 articles
has been selected to train the corresponding wk-NN classifier,
and 300-400 articles have been used to test it
February 16, 2011 β Valencia (Spain)
78. Conclusions
ο Bioinformatics needs suitable, automated, and β intelligentβ
solutions to acquire, analyse, organize, and store biological data
ο IR might be very useful to face with bioinformatics problems
ο Currently, few IR techniques have been adopted to solve some
bioinformatics tasks
ο A system aimed at retrieving and filtering bioinformatics
publications has been presented as case study
ο We argue that further investigations and experiments could be
made to exploit IR in bioinformatics
February 16, 2011 β Valencia (Spain)
79. Acknowledgments
ο This work was partially supported by the Italian Ministry of
Education β Investment funds for basic research, under the
project ITALBIONET β Italian Network of Bioinformatics
ο I wish to thank all the IASC Group members for their valuable
help
ο IASC Group members are:
ο G. Armano β head
ο A. Addis, F. Mascia and E. Vargiu β PhD, Post Doc
ο A. Giuliani, N. Hatami, M. Javarone and F. Ledda β PhD students
ο S. Curatti β collaborator, programmer
ο I wish to thank also Andrea Manconi for his suggestions
February 16, 2011 β Valencia (Spain)
80. Thanks for your
attention!
Contact: Eloisa Vargiu vargiu@diee.unica.it
February 16, 2011 β Valencia (Spain)