Bioinformatics Meets Information Retrieval: State of the Art and a Case Study

Bioinformatics Meets
Information Retrieval
State of the Art and a Case Study
Eloisa Vargiu

Intelligent Agents and Soft-Computing Group
Dept. of Electrical and Electronic Engineering
University of Cagliari, Italy
February 16, 2011 – Valencia (Spain) email: vargiu@diee.unica.it

My Background

 2000 – 2004  2004 – 2009
 Automatic planning  Bioinformatics
 Classic domains: HW[]  Protein secondary structure
 Dynamic domains: HIPE prediction: MASSP3 and
GAME/SSP
 2000 - …
 2006 - …
 Multiage s te
nt ys ms
 Information Retrieval
 A Personalized Adaptive and
Cooperative Multiagent  Hierarchical text
System: PACMAS categorization: PF and TSA
 A generic architecture to  Recommender systems and
perform information retrieval contextual advertising: ConCA
tasks: X.MAS

February 16, 2011 – Valencia (Spain)

Outline

 Context and Mission
 Why Bioinformatics Needs Information Retrieval
 Bioinformatics Meets Information Retrieval
 Case Study: Retrieving and Filtering Bioinformatics Publications
 Conclusions


Context and Mission


Web Evolution

 Web 1.0 1993

 Source of information
 Personal homepages
 Web 2.0 2004
 Social networks
 (Micro)Blogging
 Web 3.0 2005

 Semantic web
 Web composition


Web Evolution and Bioinformatics

 A long time ago...
 Data was stored in local DBs
 Data was shared as flat files
 Biologists worked alone or in small groups


Web Evolution and Bioinformatics

 Today...
 Online repositories
 The major sources of nucleotide sequence are the ones belonging to the
International Nucleotide Sequence Database Collaboration
 DDBJ (DNA DataBank of Japan)

 EMBL (European Molecular Biology Laboratory)

 GenBank (NIH genetic sequence database)

 Web services
 Basic bioinformatics services are
classified by the EBI into three categories
 SSS (Sequence Search Services)

 MSA (Multiple Sequence Alignment)

 BSA (Biological Sequence Analysis)


Web Evolution and Scientific
Publications
 A long time ago...
 Publications were consulted at the library
 Just two or three relevant available journals
 Manual selection of relevant publications


Web Evolution and Scientific
Publications
 Today...
 Online journals
 Online conference proceedings
 Publications are often available for free
 Manual selection of relevant publications
becomes unfeasible


As a Consequence...

 Unstructured information
 Information overload
 Personalized information selection and input imbalance


Our Mission

 To cope with
 Unstructured information, classifying documents according to a
given taxonomy
 Information overload, filtering information to reduce redundancy
 Personalized information selection and input imbalance, filtering
information according to user preferences
 Case study
 Retrieving and filtering bioinformatics publications


Research Topics

 Information Retrieval
 Bioinformatics



Information Retrieval (IR) deals with the representation,
Information Retrieval (IR) deals with the representation,
storage, organization of, and access to information items.
storage, organization of, and access to information items.

The user must first translate this information need into a query
The user must first translate this information need into a query
which can be processed by an IR system.
which can be processed by an IR system.

Given the user query, the key goal of an IR system is to retrieve
Given the user query, the key goal of an IR system is to retrieve
information which might be useful or relevant to the user.
information which might be useful or relevant to the user.

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
New York: Addison-Wesley, 1999.


Main IR Topics

 Indexing
 Search and Web Search
 Information Filtering
 Text Mining
 Text Categorization and Hierarchical Text Categorization


Bioinformatics

Bioinformatics is the field of science in which biology,
Bioinformatics is the field of science in which biology,
computer science, and information technology merge to form a
computer science, and information technology merge to form a
single discipline.
single discipline.

The ultimate goal of the field is to enable the discovery of new
The ultimate goal of the field is to enable the discovery of new
biological insights as well as to create a global perspective from
biological insights as well as to create a global perspective from
which unifying principles in biology can be discerned.
which unifying principles in biology can be discerned.

National Center for Biotechnology Information (NCBI),
http://www.ncbi.nlm.nih.gov/.


Main Bioinformatics Research Areas

 Sequence analysis
 Genome annotation
 Computational evolutionary biology
 Analysis of gene expression
 Analysis of protein expression
 Analysis of mutations in cancer
 Comparative genomics
 Modelling biological systems
 Prediction of protein structure
 Molecular interaction


Why Bioinformatics
Needs


Does Bioinformatics Need IR?

 Bioinformatics is concerned with researching, developing and
applying tools and methods to acquire, analyse, organize and
store biological and medical data

 Indexing and search techniques may help in the task of acquiring
 Information filtering, text mining and text categorization
techniques may be useful to the analysis of data
 Text categorization, with particular reference to hierarchical text
categorization, may be used in the organization and storage tasks


Bioinformatics Data

 A very huge amount of of data to be
 Indexed
 Searched for in large databases or on the web
 Filtered according to users' preferences
 Text mined
 Categorized according to its textual content


DB Indexing

 Why
 Data types are relegated to blob and unstructured text fields
 Few results in building persistent access paths to support fast
retrieval methods
 Genomic datasets in public repositories are annotated with free-text
fields describing the pathological state of the studied sample
 Annotations are not mapped to concept in any ontology


DB Indexing

 Who
 MoBIoS – Molecular Biological Information System
 What
 A specialized database management system
 The storage manager is based on metric-space indexing
 Query language entails biological data types
 Where
 Sequence homology: local alignment and mutations

D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to
Support Biological Discovery. Proceedings of the International
Conference on Scientific and Statistical Database Management
Systems, 2003.

DB Indexing

 Who
 --
 What
 Ontology-driven indexing of public datasets for translational
bioinformatics
 Methods to map text annotations of gene expression datasets to
concept in the UMLS
 Where
 Gene Expression Omnibus
 Standford Tissue Microarray Database

N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A.
.
Musen. Ontology-driven indexing of public datasets for translational
bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009.

Web Indexing

 Why
 Most often sequence retrieval tools and sequence analysis tools are
separated
 The usage of sequence DBs is often general and limited to
keyword searching and entry retrieval
 Discovering and accessing the appropriate bioinformatics resource
for a specific task has become increasingly important


Web Indexing

 Who
 SIRW – A Web Server for Simple Indexing and Retrieval System
 What
 A WWW interface to the Simple Indexing and Retrieval (SIR)
system to parse and index flat file DBs
 A framework for doing sequence analysis for selected biological
sequences
 Where
 Sequence analysis: motif pattern searches

C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval
System that combines sequence motif searches with keyword searches.
Nucleic Acids Research, 31(13). pp. 3771-3774, 2003.


Web Indexing

 Who
 BIRI - BIoinformatics Resource Inventory
 What
 An approach for automatically discovering and indexing public
bioinformatics resources
 Where
 The scientific literature

G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V.
Maojo. BIRI: a new approach for automatically discovering and
indexing available public bioinformatics resources from the literature.
BMC Bioinformatics, Oct 7;10:320, 2009.

DB Search

 Why
 A wealth of bioinformatics tools and databases has been created
over the last decade and most are freely available
 Often it is desired to visualize the database hits stacked according
to the query sequence
 There is no inventory presenting an up-to-date and easily
searchable index of all these resources


DB Search

 Who
 MView – Multiple alignment Viewer
 What
 A tool for converting the result of a sequence database search into
the form of a coloured multiple alignment of hits stacked against
the query
 Where
 Multiple alignment

N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible
.
database search or multiple alignment viewer. Bioinformatics, 14(4), pp.
380-381, 1998.


DB Search

 Who
 BioWareDB
 What
 An extensive and current catalog of software and DBs of relevance
to researchers in the field of biology and medicine
 Where
 Current and available biomedical computing resources

M.W. Matthiessen. BioWareDB: the biomedical software and database
search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003.


Web Search

 Why
 Today, scientists can easily post their research findings on the Web
or compare their discoveries with previous work
 Manually maintaining a wrapper library will not scale to
accommodate the growth of genomics data sources on the Web


Web Search

 Who
 ---
 What
 An automated system able to find, classify, and wrap new sources
without constant human intervention
 Where
 Distributed genomics data sources

D. Rocco and T. Critchlow. Automatic discovery and classification of
bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933,
2003.


Web Search

 Who
 GoPubMed
 What
 An ontology-based literature search applied to Gene Ontology
(GO) and PubMed
 Where
 Scientific literature

R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed:
ontology-based literature search applied to gene ontology and PubMed.
In Proceedings of German Bioinformatics Conference, pp. 169–178,
2004.

Information Filtering

 Why
 In the Web 2.0 scenario, users look for collaborative environments,
in which they can meet further users with similar preferences and
needs
 Researchers need to search for and/or generate specialized datasets
that meet specific requirements



 Who
 ProDaMa-C Protein Dataset Management – Collaborative
 What
 A web application aimed at
 Generating specialized protein structure datasets
 Favouring the collaboration among researchers
 Where
 Protein structures

G. Armano and A. Manconi. A Collaborative Web Application for
Supporting Researchers in the Task of Generating Protein Datasets.
Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A.
Soro, E. Vargiu (eds.), Springer-Verlag, 2011.


 Who
 Gene Recommender
 What
 An algorithm that ranks genes according to how strongly they
correlate with a set of query genes
 Where
 Analysis of gene expression

A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene
recommender algorithm to identify coexpressed genes. Genome
Research, Aug;13(8), pp. 1828-37, 2003.


Text Mining

 Why
 Web-based tools capable of filtering public DBs are more and more
required
 Interesting and useful information, relevant to the researcher, could
appear in documents (e.g., papers) they have not read and therefore
be missed entirely
 Of paramount importance to DB search methods is a reliable
means of distinguishing true hits from false hits
 Biologists construct a pathway by reading a large number of
articles and interpreting them a consistent network, but the link to
the original article is missed


Text Mining

 Who
 MedMiner
 What
 An Internet text mining tool that filters the literature and presents
the most relevant portions in a well-organized way that facilitate
understanding
 Where
 Gene expression profiling

L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N.
Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical
Information, with Application to Gene Expression Profiling.
Biotechniques, Dec;27(6), pp. 1210-4, 1999.

Text Mining

 Who
 BioRAT
 What
 A research assistant that, given a query,
 autonomously finds a set of papers
 reads them
 highlights the most relevant facts in each
 Where
 Scientific literature

D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones.
.
BioRAT: Extracting biological information from full-length papers.
Bioinformatics, 20(17), pp. 3206–3213, 2004.


Text Mining

 Who
 SAWTED – Structure Assignment With Text Description
 What
 An automated system to filtering DB hits
 Where
 Homologues annotation

R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure
assignment with text description-enhanced detection of remote
homologues with automated SWISS-PROT annotation comparisons.
Bioinformatics, Feb;16(2), pp. 125-9, 2000.

Text Mining

 Who
 PathText
 What
 A system to integrate a pathway visualized, text mining systems
and annotation tools into a seamless environment
 Where
 Pathway visualizations

B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S.
Ananiadou, and J. Tsujii. PathText: a text mining integrator for
biological pathway visualizations. Bioinformatics, 26(12), pp. i374-
i381, 2010.

Text Categorization

 Why
 Information in text form, such as MEDLINE records, is a greatly
underutilized source of biological information
 Individual researchers find it difficult to keep up with all the new,
relevant information
 Systems that extract structured information from natural language
passages have been highly successful in specialized domains
 Time is ripe for developing such applications for molecular biology
and genomics


Text Categorization

 Who
 --
 What
 Constructing biological knowledge bases by extracting information
from text sources
 Where
 MEDLINE

M. Craven and J. Kumlien. Constructing Biological Knowledge Bases
by Extracting Information from Text Sources. In Proceedings of the 7th
International Conference on Intelligent Systems for Molecular Biology,
1999.

Text Categorization

 Who
 Genies
 What
 A natural-language processing system for the extraction of
molecular pathways
 Where
 Scientific publications

C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies:
.
a natural-language processing system for the extraction of molecular
pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001.


Hierarchical Text Categorization

 Why
 A great deal of genomics information accumulated through years is
available in online text repositories (such as MEDLINE)
 These resources do not still provide adequate mechanisms for
retrieving the required information
 Traditional filtering techniques based on keyword search are often
inadequate to express what the user is really searching for
 Web repositories, such as Medical Subject Headings (MeSH) in
MEDLINE, encompass an underlying taxonomy



 Who
 --
 What
 A tool for assisting biologists with literature search for the task of
associating genes with Gene Ontology codes
 Where
 MEDLINE

S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text
categorization as a tool of associating genes with gene ontology codes.
In 2nd European Workshop on Data Mining and Text Mining for
Bioinformatics, pp. 26–30, 2004.


 Who
 Pub.MAS
 What
 A multiagent system for retrieving and classifying publications
 Where
 BMC Bioinformatics
 PubMed Central

G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
Retrieving Bioinformatics Publications from Web Sources. IEEE
Transactions on Nanobioscience, Special Session on GRID, Web
Services, Software Agents and Ontology Applications for Life Science,
6(2), pp. 104-109, 2007.

Case Study:
Retrieving and Filtering
Bioinformatic Publications


An IR Task

Information Extraction
Online Repositories
Wrapping Information Sources

Extracted Data/Information

Text Categorization
Selected Data/Information Taxonomic Classification of Items

User's Feedback

Adaptive Behavior



 Essential to retrieve documents provided by heterogeneous and
distributed sources

A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) :
A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp.
84–93.

Text Categorization

 It is the task of determining and assigning topical labels to
content
 Typical approaches to text categorization
 Statistical
 Semantic
 In the last years several researchers have investigated the use of
hierarchies for text categorization

F. Sebastiani. A tutorial on automated text categorisation. Proceedings
of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7-
35, 1999.


Users' Feedback

 It is aimed at dealing with any feedback provided by the user
 In semiautomated classification and adaptive filtering we may
expect the user of a classifier to provide feedback on how test
documents have been classified
 In this case further training may be performed during the
operating phase



Hierarchical Text Categorization (HTC) deals with problems
Hierarchical Text Categorization (HTC) deals with problems
where categories are organized in the form of a hierarchy.
where categories are organized in the form of a hierarchy.

D. Koller, M. Sahami. Hierarchically classifying documents using very
few words. Proceedings of 14th International Conference on Machine
Learning, pp. 170– 178, 1997.


HTC at a Glance

 HTC studies how to improve the performances provided by
classical text categorization techniques by exploiting the
knowledge of the taxonomic relationships among classes


Motivations

 People organize large collections of documents in hierarchies of
topics, or arrange a large body of knowledge in ontologies
 The main goal of automatic text categorization is to deal with
underlying taxonomies
 A hierarchical approach can
give benefits in real-world
scenarios, characterized by
information overload and
imbalanced data


HTC Approaches

 Pachinko machine
 At each level of the hierarchy
 The classifier selects the one most probable category
 It goes down the hierarchy inspecting only the children of the selected
nodes
 Probabilistic hierarchical local approach
 At each level of the hierarchy
 The classifier makes probabilistic decisions
 It selects the leaf categories on the most probable paths

S. Kiritchenko. Hierarchical text categorization and its application to
bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006.

HTC Approaches

 Local classifier per node
 Each classifier decides if forwarding the document to its children
 Local classifier per parent node
 Each classifier decides to which subtree(s) the document should be
sent to
 Local classifier per level
 The number of outputs per level grows while going down through
the taxonomy
 Global classifier
 One classifier is trained, able to discriminate among all categories

C.J. Silla and A. Freitas. A survey on hierarchical classification across
different application domains. Journal of Data Mining and Knowledge
Discovery, 2(1-2), pp. 31-72, 2010.

Progressive Filtering

 Progressive Filtering (PF) is a simple categorization technique
that operates on hierarchically structured categories
 A way to implement PF consists of decomposing a given rooted
taxonomy into pipelines, one for of each path that exists between
the root and each node of the taxonomy
 Each node is a binary classifier able to recognize whether or not
an input belongs to the corresponding class
 A threshold selection algorithm (TSA) can be run to identify an
optimal, or sub-optimal, combination of thresholds for each
pipeline
A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to
Perform Hierarchical Text Categorization in Presence of Input
Imbalance. Proceedings of International Conference on Knowledge
February 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010.
Discovery and

PF at a Glance

 Starting from the root, each input traverses the taxonomy as a
“token”

Classifiers in PF

 Partitioning the taxonomy in pipelines gives rise to a set of new
classifiers, each represented by a pipeline


Classifiers in PF


Classifiers in PF

 The same classifier may have different behaviours, depending on
which pipeline it is embedded
 Each pipeline can be considered in isolation from the others

Threshold Selection in PF

 A relevant problem is how to calibrate the threshold of the
binary classifiers embedded by each pipeline in order to
optimize the pipeline behaviour
 Searching for a optimal or sub-optimal combination of
thresholds in a pipeline can be actually viewed as the problem of
finding a maximum in a utility function F that depends on the
corresponding threshold vector θ


TSA

 For each pipeline the best combination of thresholds is
calculated according to a bottom up algorithm that uses two
functions
 Repair which increases/decreases (↑ / ↓ the threshold until the
)
utility function reaches a maximum
 Calibrate which recursively operates downward from the given
classifier by repeatedly calling repair (↑ / ↓)

A. Addis, G. Armano, E. Vargiu. A comparative experimental
assessment of a threshold selection algorithm in hierarchical text
categorization. In: Advances in Information Retrieval. The 33rd
European Conference on Information Retrieval (ECIR 2011), 2011


TSA: An Example


The Prototype

 MultiAgent Architecture
 X.MAS
 Agent Framework
 JADE

A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent
Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6,
Sixth International Workshop, From Agent Theory to Agent
Implementation, pp. 3–9, 2008.

F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent
Systems with JADE (Wiley Series in Agent Technology). John Wiley
and Sons, 2007.

X.MAS at a Glance

 Macro-architecture


X.MAS at a Glance
Information Agent
Scheduler Source
 Micro-architecture
Middle Agent
Scheduler Dispatcher

Filter Agent
Scheduler Actuator

Middle Agent

Task Agent
Scheduler Actuator

Middle Agent

Interface Agent
Scheduler


Pub.MAS


Pub.MAS

G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
Retrieving Bioinformatics Publications from Web Sources. IEEE
Transactions on Nanobioscience, Special Session on GRID, Web
Services, Software Agents and Ontology Applications for Life Science,
6(2), pp. 104-109, 2007.


 It is supported by a set of agents explicitly devoted to
 wrap the selected information sources
 encode the extracted documents
 An information agent wraps BMC Bioinformatics web site
 HTML wrapper
 An information agent wraps PubMed Central digital archive
 Web service wrapper



 The PF approach previously described has been implemented
 Document has been encoded to
 remove all non-informative words
 remove the most common morphological and inflexional suffixes
 select the relevant features
 generate a feature vector for each document
 Classification is performed by wkNN classifiers
 the score is assigned using non parametric density estimation of the
“ a posteriori” probability


The Adopted Taxonomy

P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A.
.
Brass. An ontology for bioinformatics applications, Bioinformatics,
15(6), pp. 510–520, 1999.

The Adopted Taxonomy


Users' Feedback

 User feedback is aimed at dealing with any feedback provided
by the user
 Two solutions have been experimented
 training an ANN
 using a kNN classifier


Experiments

 Different kinds of tests have been performed, each aimed at
highlighting a specific issue
 we estimated the (normalized) confusion matrix for each classifier
belonging to the highest level of the taxonomy
 we studied the impact of taking into account pipelines of
classifiers, also trying to assess whether a residual independence
was in fact present
 we assessed the solution devised for implementing user’s feedback,
based on the k-NN technique


Experiments

 Tests have been performed using selected publications extracted
from the BMC Bioinformatics site and from the PubMed Central
digital archive
 Publications have been classified by an expert of the domain
according to the proposed taxonomy
 For each item of the taxonomy, a set of about 100-150 articles
has been selected to train the corresponding wk-NN classifier,
and 300-400 articles have been used to test it


Conclusions


Conclusions

 Bioinformatics needs suitable, automated, and “ intelligent”
solutions to acquire, analyse, organize, and store biological data
 IR might be very useful to face with bioinformatics problems
 Currently, few IR techniques have been adopted to solve some
bioinformatics tasks
 A system aimed at retrieving and filtering bioinformatics
publications has been presented as case study
 We argue that further investigations and experiments could be
made to exploit IR in bioinformatics


Acknowledgments

 This work was partially supported by the Italian Ministry of
Education – Investment funds for basic research, under the
project ITALBIONET – Italian Network of Bioinformatics
 I wish to thank all the IASC Group members for their valuable
help
 IASC Group members are:
 G. Armano – head
 A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc
 A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students
 S. Curatti – collaborator, programmer
 I wish to thank also Andrea Manconi for his suggestions


Thanks for your
attention!
Contact: Eloisa Vargiu vargiu@diee.unica.it


Bioinformatics Meets Information Retrieval: State of the Art and a Case Study

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Bioinformatics Meets Information Retrieval: State of the Art and a Case Study

Similar to Bioinformatics Meets Information Retrieval: State of the Art and a Case Study (20)

More from Eloisa Vargiu

More from Eloisa Vargiu (20)

Recently uploaded

Recently uploaded (20)

Bioinformatics Meets Information Retrieval: State of the Art and a Case Study