SlideShare a Scribd company logo
1 of 80
Bioinformatics Meets
               Information Retrieval
   State of the Art and a Case Study
                                              Eloisa Vargiu



                           Intelligent Agents and Soft-Computing Group
                          Dept. of Electrical and Electronic Engineering
                                       University of Cagliari, Italy
February 16, 2011 – Valencia (Spain)   email: vargiu@diee.unica.it
My Background

 ο‚— 2000 – 2004                                ο‚— 2004 – 2009
    ο‚— Automatic planning                         ο‚— Bioinformatics
          ο‚—   Classic domains: HW[]               ο‚—   Protein secondary structure
          ο‚—   Dynamic domains: HIPE                   prediction: MASSP3 and
                                                      GAME/SSP
 ο‚— 2000 - …
                                              ο‚— 2006 - …
    ο‚— Multiage s te
              nt ys ms
                                                 ο‚— Information Retrieval
          ο‚—   A Personalized Adaptive and
              Cooperative Multiagent              ο‚—   Hierarchical text
              System: PACMAS                          categorization: PF and TSA
          ο‚—   A generic architecture to           ο‚—   Recommender systems and
              perform information retrieval           contextual advertising: ConCA
              tasks: X.MAS


February 16, 2011 – Valencia (Spain)
Outline

 ο‚—   Context and Mission
 ο‚—   Why Bioinformatics Needs Information Retrieval
 ο‚—   Bioinformatics Meets Information Retrieval
 ο‚—   Case Study: Retrieving and Filtering Bioinformatics Publications
 ο‚—   Conclusions




February 16, 2011 – Valencia (Spain)
Context and Mission




February 16, 2011 – Valencia (Spain)
Web Evolution

 ο‚— Web 1.0                             1993

   ο‚— Source of information
   ο‚— Personal homepages
 ο‚— Web 2.0                             2004
   ο‚— Social networks
   ο‚— (Micro)Blogging
 ο‚— Web 3.0                             2005

   ο‚— Semantic web
   ο‚— Web composition




February 16, 2011 – Valencia (Spain)
Web Evolution and Bioinformatics

 ο‚— A long time ago...
    ο‚— Data was stored in local DBs
    ο‚— Data was shared as flat files
    ο‚— Biologists worked alone or in small groups




February 16, 2011 – Valencia (Spain)
Web Evolution and Bioinformatics

 ο‚— Today...
    ο‚— Online repositories
          ο‚—   The major sources of nucleotide sequence are the ones belonging to the
              International Nucleotide Sequence Database Collaboration
               ο‚— DDBJ (DNA DataBank of Japan)

               ο‚— EMBL (European Molecular Biology Laboratory)

               ο‚— GenBank (NIH genetic sequence database)

      ο‚— Web services
        ο‚— Basic bioinformatics services are
          classified by the EBI into three categories
           ο‚— SSS (Sequence Search Services)

           ο‚— MSA (Multiple Sequence Alignment)

           ο‚— BSA (Biological Sequence Analysis)



February 16, 2011 – Valencia (Spain)
Web Evolution and Scientific
Publications
 ο‚— A long time ago...
    ο‚— Publications were consulted at the library
    ο‚— Just two or three relevant available journals
    ο‚— Manual selection of relevant publications




February 16, 2011 – Valencia (Spain)
Web Evolution and Scientific
Publications
 ο‚— Today...
    ο‚— Online journals
    ο‚— Online conference proceedings
    ο‚— Publications are often available for free
    ο‚— Manual selection of relevant publications
      becomes unfeasible




February 16, 2011 – Valencia (Spain)
As a Consequence...

 ο‚— Unstructured information
 ο‚— Information overload
 ο‚— Personalized information selection and input imbalance




February 16, 2011 – Valencia (Spain)
Our Mission

 ο‚— To cope with
    ο‚— Unstructured information, classifying documents according to a
      given taxonomy
    ο‚— Information overload, filtering information to reduce redundancy
    ο‚— Personalized information selection and input imbalance, filtering
      information according to user preferences
 ο‚— Case study
    ο‚— Retrieving and filtering bioinformatics publications




February 16, 2011 – Valencia (Spain)
Research Topics

 ο‚— Information Retrieval
 ο‚— Bioinformatics




February 16, 2011 – Valencia (Spain)
Information Retrieval

 Information Retrieval (IR) deals with the representation,
  Information Retrieval (IR) deals with the representation,
 storage, organization of, and access to information items.
  storage, organization of, and access to information items.

 The user must first translate this information need into a query
 The user must first translate this information need into a query
 which can be processed by an IR system.
 which can be processed by an IR system.

 Given the user query, the key goal of an IR system is to retrieve
  Given the user query, the key goal of an IR system is to retrieve
 information which might be useful or relevant to the user.
  information which might be useful or relevant to the user.


                R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
                New York: Addison-Wesley, 1999.

February 16, 2011 – Valencia (Spain)
Main IR Topics

 ο‚—   Indexing
 ο‚—   Search and Web Search
 ο‚—   Information Filtering
 ο‚—   Text Mining
 ο‚—   Text Categorization and Hierarchical Text Categorization




February 16, 2011 – Valencia (Spain)
Bioinformatics

 Bioinformatics is the field of science in which biology,
  Bioinformatics is the field of science in which biology,
 computer science, and information technology merge to form a
  computer science, and information technology merge to form a
 single discipline.
  single discipline.

 The ultimate goal of the field is to enable the discovery of new
 The ultimate goal of the field is to enable the discovery of new
 biological insights as well as to create a global perspective from
 biological insights as well as to create a global perspective from
 which unifying principles in biology can be discerned.
 which unifying principles in biology can be discerned.



                National Center for Biotechnology Information (NCBI),
                http://www.ncbi.nlm.nih.gov/.

February 16, 2011 – Valencia (Spain)
Main Bioinformatics Research Areas

 ο‚—   Sequence analysis
 ο‚—   Genome annotation
 ο‚—   Computational evolutionary biology
 ο‚—   Analysis of gene expression
 ο‚—   Analysis of protein expression
 ο‚—   Analysis of mutations in cancer
 ο‚—   Comparative genomics
 ο‚—   Modelling biological systems
 ο‚—   Prediction of protein structure
 ο‚—   Molecular interaction

February 16, 2011 – Valencia (Spain)
Why Bioinformatics
                       Needs
                Information Retrieval




February 16, 2011 – Valencia (Spain)
Does Bioinformatics Need IR?

 ο‚— Bioinformatics is concerned with researching, developing and
    applying tools and methods to acquire, analyse, organize and
    store biological and medical data

 ο‚— Indexing and search techniques may help in the task of acquiring
 ο‚— Information filtering, text mining and text categorization
   techniques may be useful to the analysis of data
 ο‚— Text categorization, with particular reference to hierarchical text
   categorization, may be used in the organization and storage tasks



February 16, 2011 – Valencia (Spain)
Bioinformatics Data

 ο‚— A very huge amount of of data to be
    ο‚— Indexed
    ο‚— Searched for in large databases or on the web
    ο‚— Filtered according to users' preferences
    ο‚— Text mined
    ο‚— Categorized according to its textual content




February 16, 2011 – Valencia (Spain)
DB Indexing

 ο‚— Why
   ο‚— Data types are relegated to blob and unstructured text fields
   ο‚— Few results in building persistent access paths to support fast
     retrieval methods
   ο‚— Genomic datasets in public repositories are annotated with free-text
     fields describing the pathological state of the studied sample
   ο‚— Annotations are not mapped to concept in any ontology




February 16, 2011 – Valencia (Spain)
DB Indexing

 ο‚— Who
   ο‚— MoBIoS – Molecular Biological Information System
 ο‚— What
   ο‚— A specialized database management system
   ο‚— The storage manager is based on metric-space indexing
   ο‚— Query language entails biological data types
 ο‚— Where
   ο‚— Sequence homology: local alignment and mutations


                D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to
                Support Biological Discovery. Proceedings of the International
                Conference on Scientific and Statistical Database Management
                Systems, 2003.
February 16, 2011 – Valencia (Spain)
DB Indexing

ο‚— Who
  ο‚— --
ο‚— What
  ο‚— Ontology-driven indexing of public datasets for translational
    bioinformatics
  ο‚— Methods to map text annotations of gene expression datasets to
    concept in the UMLS
ο‚— Where
  ο‚— Gene Expression Omnibus
  ο‚— Standford Tissue Microarray Database

                N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A.
                                           .
                Musen. Ontology-driven indexing of public datasets for translational
                bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009.
February 16, 2011 – Valencia (Spain)
Web Indexing

 ο‚— Why
   ο‚— Most often sequence retrieval tools and sequence analysis tools are
     separated
   ο‚— The usage of sequence DBs is often general and limited to
     keyword searching and entry retrieval
   ο‚— Discovering and accessing the appropriate bioinformatics resource
     for a specific task has become increasingly important




February 16, 2011 – Valencia (Spain)
Web Indexing

 ο‚— Who
   ο‚— SIRW – A Web Server for Simple Indexing and Retrieval System
 ο‚— What
   ο‚— A WWW interface to the Simple Indexing and Retrieval (SIR)
     system to parse and index flat file DBs
   ο‚— A framework for doing sequence analysis for selected biological
     sequences
 ο‚— Where
   ο‚— Sequence analysis: motif pattern searches

                C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval
                System that combines sequence motif searches with keyword searches.
                Nucleic Acids Research, 31(13). pp. 3771-3774, 2003.

February 16, 2011 – Valencia (Spain)
Web Indexing

 ο‚— Who
   ο‚— BIRI - BIoinformatics Resource Inventory
 ο‚— What
   ο‚— An approach for automatically discovering and indexing public
     bioinformatics resources
 ο‚— Where
   ο‚— The scientific literature




                G. de la Calle, M. GarcΓ­a-Remesal, S. Chiesa, D. de la Iglesia, V.
                Maojo. BIRI: a new approach for automatically discovering and
                indexing available public bioinformatics resources from the literature.
                BMC Bioinformatics, Oct 7;10:320, 2009.
February 16, 2011 – Valencia (Spain)
DB Search

 ο‚— Why
   ο‚— A wealth of bioinformatics tools and databases has been created
     over the last decade and most are freely available
   ο‚— Often it is desired to visualize the database hits stacked according
     to the query sequence
   ο‚— There is no inventory presenting an up-to-date and easily
     searchable index of all these resources




February 16, 2011 – Valencia (Spain)
DB Search

 ο‚— Who
   ο‚— MView – Multiple alignment Viewer
 ο‚— What
   ο‚— A tool for converting the result of a sequence database search into
     the form of a coloured multiple alignment of hits stacked against
     the query
 ο‚— Where
   ο‚— Multiple alignment


                N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible
                   .
                database search or multiple alignment viewer. Bioinformatics, 14(4), pp.
                380-381, 1998.

February 16, 2011 – Valencia (Spain)
DB Search

 ο‚— Who
   ο‚— BioWareDB
 ο‚— What
   ο‚— An extensive and current catalog of software and DBs of relevance
     to researchers in the field of biology and medicine
 ο‚— Where
   ο‚— Current and available biomedical computing resources




                M.W. Matthiessen. BioWareDB: the biomedical software and database
                search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003.


February 16, 2011 – Valencia (Spain)
Web Search

 ο‚— Why
   ο‚— Today, scientists can easily post their research findings on the Web
     or compare their discoveries with previous work
   ο‚— Manually maintaining a wrapper library will not scale to
     accommodate the growth of genomics data sources on the Web




February 16, 2011 – Valencia (Spain)
Web Search

 ο‚— Who
   ο‚— ---
 ο‚— What
   ο‚— An automated system able to find, classify, and wrap new sources
     without constant human intervention
 ο‚— Where
   ο‚— Distributed genomics data sources




                D. Rocco and T. Critchlow. Automatic discovery and classification of
                bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933,
                2003.

February 16, 2011 – Valencia (Spain)
Web Search

 ο‚— Who
   ο‚— GoPubMed
 ο‚— What
   ο‚— An ontology-based literature search applied to Gene Ontology
     (GO) and PubMed
 ο‚— Where
   ο‚— Scientific literature



                R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed:
                ontology-based literature search applied to gene ontology and PubMed.
                In Proceedings of German Bioinformatics Conference, pp. 169–178,
                2004.
February 16, 2011 – Valencia (Spain)
Information Filtering

 ο‚— Why
   ο‚— In the Web 2.0 scenario, users look for collaborative environments,
     in which they can meet further users with similar preferences and
     needs
   ο‚— Researchers need to search for and/or generate specialized datasets
     that meet specific requirements




February 16, 2011 – Valencia (Spain)
Information Filtering

 ο‚— Who
   ο‚— ProDaMa-C Protein Dataset Management – Collaborative
 ο‚— What
   ο‚— A web application aimed at
          ο‚—   Generating specialized protein structure datasets
          ο‚—   Favouring the collaboration among researchers
 ο‚— Where
   ο‚— Protein structures


                G. Armano and A. Manconi. A Collaborative Web Application for
                Supporting Researchers in the Task of Generating Protein Datasets.
                Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A.
                Soro, E. Vargiu (eds.), Springer-Verlag, 2011.
February 16, 2011 – Valencia (Spain)
Information Filtering

 ο‚— Who
   ο‚— Gene Recommender
 ο‚— What
   ο‚— An algorithm that ranks genes according to how strongly they
     correlate with a set of query genes
 ο‚— Where
   ο‚— Analysis of gene expression




                A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene
                recommender algorithm to identify coexpressed genes. Genome
                Research, Aug;13(8), pp. 1828-37, 2003.

February 16, 2011 – Valencia (Spain)
Text Mining

 ο‚— Why
   ο‚— Web-based tools capable of filtering public DBs are more and more
     required
   ο‚— Interesting and useful information, relevant to the researcher, could
     appear in documents (e.g., papers) they have not read and therefore
     be missed entirely
   ο‚— Of paramount importance to DB search methods is a reliable
     means of distinguishing true hits from false hits
   ο‚— Biologists construct a pathway by reading a large number of
     articles and interpreting them a consistent network, but the link to
     the original article is missed


February 16, 2011 – Valencia (Spain)
Text Mining

 ο‚— Who
   ο‚— MedMiner
 ο‚— What
   ο‚— An Internet text mining tool that filters the literature and presents
     the most relevant portions in a well-organized way that facilitate
     understanding
 ο‚— Where
   ο‚— Gene expression profiling

                L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N.
                Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical
                Information, with Application to Gene Expression Profiling.
                Biotechniques, Dec;27(6), pp. 1210-4, 1999.
February 16, 2011 – Valencia (Spain)
Text Mining

 ο‚— Who
   ο‚— BioRAT
 ο‚— What
   ο‚— A research assistant that, given a query,
          ο‚—   autonomously finds a set of papers
          ο‚—   reads them
          ο‚—   highlights the most relevant facts in each
 ο‚— Where
   ο‚— Scientific literature

                D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones.
                    .
                BioRAT: Extracting biological information from full-length papers.
                Bioinformatics, 20(17), pp. 3206–3213, 2004.

February 16, 2011 – Valencia (Spain)
Text Mining

 ο‚— Who
   ο‚— SAWTED – Structure Assignment With Text Description
 ο‚— What
   ο‚— An automated system to filtering DB hits
 ο‚— Where
   ο‚— Homologues annotation




                R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure
                assignment with text description-enhanced detection of remote
                homologues with automated SWISS-PROT annotation comparisons.
                Bioinformatics, Feb;16(2), pp. 125-9, 2000.
February 16, 2011 – Valencia (Spain)
Text Mining

 ο‚— Who
   ο‚— PathText
 ο‚— What
   ο‚— A system to integrate a pathway visualized, text mining systems
     and annotation tools into a seamless environment
 ο‚— Where
   ο‚— Pathway visualizations



                B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S.
                Ananiadou, and J. Tsujii. PathText: a text mining integrator for
                biological pathway visualizations. Bioinformatics, 26(12), pp. i374-
                i381, 2010.
February 16, 2011 – Valencia (Spain)
Text Categorization

 ο‚— Why
   ο‚— Information in text form, such as MEDLINE records, is a greatly
     underutilized source of biological information
   ο‚— Individual researchers find it difficult to keep up with all the new,
     relevant information
   ο‚— Systems that extract structured information from natural language
     passages have been highly successful in specialized domains
   ο‚— Time is ripe for developing such applications for molecular biology
     and genomics




February 16, 2011 – Valencia (Spain)
Text Categorization

 ο‚— Who
   ο‚— --
 ο‚— What
   ο‚— Constructing biological knowledge bases by extracting information
     from text sources
 ο‚— Where
   ο‚— MEDLINE



                M. Craven and J. Kumlien. Constructing Biological Knowledge Bases
                by Extracting Information from Text Sources. In Proceedings of the 7th
                International Conference on Intelligent Systems for Molecular Biology,
                1999.
February 16, 2011 – Valencia (Spain)
Text Categorization

 ο‚— Who
   ο‚— Genies
 ο‚— What
   ο‚— A natural-language processing system for the extraction of
     molecular pathways
 ο‚— Where
   ο‚— Scientific publications



                C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies:
                               .
                a natural-language processing system for the extraction of molecular
                pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001.

February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

 ο‚— Why
   ο‚— A great deal of genomics information accumulated through years is
     available in online text repositories (such as MEDLINE)
   ο‚— These resources do not still provide adequate mechanisms for
     retrieving the required information
   ο‚— Traditional filtering techniques based on keyword search are often
     inadequate to express what the user is really searching for
   ο‚— Web repositories, such as Medical Subject Headings (MeSH) in
     MEDLINE, encompass an underlying taxonomy




February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

 ο‚— Who
   ο‚— --
 ο‚— What
   ο‚— A tool for assisting biologists with literature search for the task of
     associating genes with Gene Ontology codes
 ο‚— Where
   ο‚— MEDLINE



                S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text
                categorization as a tool of associating genes with gene ontology codes.
                In 2nd European Workshop on Data Mining and Text Mining for
                Bioinformatics, pp. 26–30, 2004.
February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

 ο‚— Who
   ο‚— Pub.MAS
 ο‚— What
   ο‚— A multiagent system for retrieving and classifying publications
 ο‚— Where
   ο‚— BMC Bioinformatics
   ο‚— PubMed Central


                G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
                Retrieving Bioinformatics Publications from Web Sources. IEEE
                Transactions on Nanobioscience, Special Session on GRID, Web
                Services, Software Agents and Ontology Applications for Life Science,
                6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
Case Study:
    Retrieving and Filtering
   Bioinformatic Publications




February 16, 2011 – Valencia (Spain)
An IR Task

                                                                                                Information Extraction
             Online Repositories
                                                                                           Wrapping Information Sources




                                                                   Extracted Data/Information




                                                                                                  Text Categorization
                           Selected Data/Information                                     Taxonomic Classification of Items




                                                       User's Feedback

                                               Adaptive Behavior




February 16, 2011 – Valencia (Spain)
Information Extraction

 ο‚— Essential to retrieve documents provided by heterogeneous and
    distributed sources




                A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) :
                A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp.
                84–93.
February 16, 2011 – Valencia (Spain)
Text Categorization

 ο‚— It is the task of determining and assigning topical labels to
   content
 ο‚— Typical approaches to text categorization
      ο‚— Statistical
      ο‚— Semantic
 ο‚— In the last years several researchers have investigated the use of
    hierarchies for text categorization


                F. Sebastiani. A tutorial on automated text categorisation. Proceedings
                of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7-
                35, 1999.

February 16, 2011 – Valencia (Spain)
Users' Feedback

 ο‚— It is aimed at dealing with any feedback provided by the user
 ο‚— In semiautomated classification and adaptive filtering we may
   expect the user of a classifier to provide feedback on how test
   documents have been classified
 ο‚— In this case further training may be performed during the
   operating phase




February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

 Hierarchical Text Categorization (HTC) deals with problems
 Hierarchical Text Categorization (HTC) deals with problems
 where categories are organized in the form of a hierarchy.
 where categories are organized in the form of a hierarchy.




                D. Koller, M. Sahami. Hierarchically classifying documents using very
                few words. Proceedings of 14th International Conference on Machine
                Learning, pp. 170– 178, 1997.

February 16, 2011 – Valencia (Spain)
HTC at a Glance

 ο‚— HTC studies how to improve the performances provided by
    classical text categorization techniques by exploiting the
    knowledge of the taxonomic relationships among classes




February 16, 2011 – Valencia (Spain)
Motivations

 ο‚— People organize large collections of documents in hierarchies of
   topics, or arrange a large body of knowledge in ontologies
 ο‚— The main goal of automatic text categorization is to deal with
   underlying taxonomies
 ο‚— A hierarchical approach can
   give benefits in real-world
   scenarios, characterized by
   information overload and
   imbalanced data




February 16, 2011 – Valencia (Spain)
HTC Approaches

 ο‚— Pachinko machine
    ο‚— At each level of the hierarchy
          ο‚—   The classifier selects the one most probable category
          ο‚—   It goes down the hierarchy inspecting only the children of the selected
              nodes
 ο‚— Probabilistic hierarchical local approach
    ο‚— At each level of the hierarchy
          ο‚—   The classifier makes probabilistic decisions
          ο‚—   It selects the leaf categories on the most probable paths



                S. Kiritchenko. Hierarchical text categorization and its application to
                bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006.
February 16, 2011 – Valencia (Spain)
HTC Approaches

ο‚— Local classifier per node
   ο‚— Each classifier decides if forwarding the document to its children
ο‚— Local classifier per parent node
   ο‚— Each classifier decides to which subtree(s) the document should be
     sent to
ο‚— Local classifier per level
   ο‚— The number of outputs per level grows while going down through
     the taxonomy
ο‚— Global classifier
   ο‚— One classifier is trained, able to discriminate among all categories

                C.J. Silla and A. Freitas. A survey on hierarchical classification across
                different application domains. Journal of Data Mining and Knowledge
                Discovery, 2(1-2), pp. 31-72, 2010.
February 16, 2011 – Valencia (Spain)
Progressive Filtering

ο‚— Progressive Filtering (PF) is a simple categorization technique
  that operates on hierarchically structured categories
ο‚— A way to implement PF consists of decomposing a given rooted
  taxonomy into pipelines, one for of each path that exists between
  the root and each node of the taxonomy
ο‚— Each node is a binary classifier able to recognize whether or not
  an input belongs to the corresponding class
ο‚— A threshold selection algorithm (TSA) can be run to identify an
  optimal, or sub-optimal, combination of thresholds for each
  pipeline
                A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to
                Perform Hierarchical Text Categorization in Presence of Input
                Imbalance. Proceedings of International Conference on Knowledge
February 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010.
                Discovery and
PF at a Glance




 ο‚— Starting from the root, each input traverses the taxonomy as a
     β€œtoken”
February 16, 2011 – Valencia (Spain)
Classifiers in PF




 ο‚— Partitioning the taxonomy in pipelines gives rise to a set of new
    classifiers, each represented by a pipeline


February 16, 2011 – Valencia (Spain)
Classifiers in PF




February 16, 2011 – Valencia (Spain)
Classifiers in PF




 ο‚— The same classifier may have different behaviours, depending on
   which pipeline it is embedded
 ο‚— Each pipeline can be considered in isolation from the others
February 16, 2011 – Valencia (Spain)
Threshold Selection in PF

 ο‚— A relevant problem is how to calibrate the threshold of the
   binary classifiers embedded by each pipeline in order to
   optimize the pipeline behaviour
 ο‚— Searching for a optimal or sub-optimal combination of
   thresholds in a pipeline can be actually viewed as the problem of
   finding a maximum in a utility function F that depends on the
   corresponding threshold vector ΞΈ




February 16, 2011 – Valencia (Spain)
TSA

 ο‚— For each pipeline the best combination of thresholds is
    calculated according to a bottom up algorithm that uses two
    functions
      ο‚— Repair which increases/decreases (↑ / ↓ the threshold until the
                                               )
        utility function reaches a maximum
      ο‚— Calibrate which recursively operates downward from the given
        classifier by repeatedly calling repair (↑ / ↓)


                A. Addis, G. Armano, E. Vargiu. A comparative experimental
                assessment of a threshold selection algorithm in hierarchical text
                categorization. In: Advances in Information Retrieval. The 33rd
                European Conference on Information Retrieval (ECIR 2011), 2011


February 16, 2011 – Valencia (Spain)
TSA: An Example




February 16, 2011 – Valencia (Spain)
The Prototype

 ο‚— MultiAgent Architecture
    ο‚— X.MAS
 ο‚— Agent Framework
    ο‚— JADE



                A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent
                Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6,
                Sixth International Workshop, From Agent Theory to Agent
                Implementation, pp. 3–9, 2008.

                F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent
                Systems with JADE (Wiley Series in Agent Technology). John Wiley
                and Sons, 2007.
February 16, 2011 – Valencia (Spain)
X.MAS at a Glance

 ο‚— Macro-architecture




February 16, 2011 – Valencia (Spain)
X.MAS at a Glance
                                                   Information Agent
                                       Scheduler          Source
 ο‚— Micro-architecture
                                                       Middle Agent
                                       Scheduler        Dispatcher

                                                       Filter Agent
                                       Scheduler         Actuator

                                                       Middle Agent
                                       Scheduler        Dispatcher

                                                        Task Agent
                                       Scheduler         Actuator

                                                       Middle Agent
                                       Scheduler        Dispatcher

                                                     Interface Agent
                                       Scheduler


February 16, 2011 – Valencia (Spain)
Pub.MAS




February 16, 2011 – Valencia (Spain)
Pub.MAS




                G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
                Retrieving Bioinformatics Publications from Web Sources. IEEE
                Transactions on Nanobioscience, Special Session on GRID, Web
                Services, Software Agents and Ontology Applications for Life Science,
                6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
Information Extraction

 ο‚— It is supported by a set of agents explicitly devoted to
    ο‚— wrap the selected information sources
    ο‚— encode the extracted documents
 ο‚— An information agent wraps BMC Bioinformatics web site
    ο‚— HTML wrapper
 ο‚— An information agent wraps PubMed Central digital archive
    ο‚— Web service wrapper




February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

 ο‚— The PF approach previously described has been implemented
 ο‚— Document has been encoded to
    ο‚— remove all non-informative words
    ο‚— remove the most common morphological and inflexional suffixes
    ο‚— select the relevant features
    ο‚— generate a feature vector for each document
 ο‚— Classification is performed by wkNN classifiers
    ο‚— the score is assigned using non parametric density estimation of the
      β€œ a posteriori” probability




February 16, 2011 – Valencia (Spain)
The Adopted Taxonomy




                P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A.
                 .
                Brass. An ontology for bioinformatics applications, Bioinformatics,
                15(6), pp. 510–520, 1999.
February 16, 2011 – Valencia (Spain)
The Adopted Taxonomy




February 16, 2011 – Valencia (Spain)
The Adopted Taxonomy




February 16, 2011 – Valencia (Spain)
Users' Feedback

 ο‚— User feedback is aimed at dealing with any feedback provided
   by the user
 ο‚— Two solutions have been experimented
      ο‚— training an ANN
      ο‚— using a kNN classifier




February 16, 2011 – Valencia (Spain)
Experiments

 ο‚— Different kinds of tests have been performed, each aimed at
    highlighting a specific issue
      ο‚— we estimated the (normalized) confusion matrix for each classifier
        belonging to the highest level of the taxonomy
      ο‚— we studied the impact of taking into account pipelines of
        classifiers, also trying to assess whether a residual independence
        was in fact present
      ο‚— we assessed the solution devised for implementing user’s feedback,
        based on the k-NN technique




February 16, 2011 – Valencia (Spain)
Experiments

 ο‚— Tests have been performed using selected publications extracted
   from the BMC Bioinformatics site and from the PubMed Central
   digital archive
 ο‚— Publications have been classified by an expert of the domain
   according to the proposed taxonomy
 ο‚— For each item of the taxonomy, a set of about 100-150 articles
   has been selected to train the corresponding wk-NN classifier,
   and 300-400 articles have been used to test it




February 16, 2011 – Valencia (Spain)
Conclusions




February 16, 2011 – Valencia (Spain)
Conclusions

 ο‚— Bioinformatics needs suitable, automated, and β€œ intelligent”
     solutions to acquire, analyse, organize, and store biological data
 ο‚—   IR might be very useful to face with bioinformatics problems
 ο‚—   Currently, few IR techniques have been adopted to solve some
     bioinformatics tasks
 ο‚—   A system aimed at retrieving and filtering bioinformatics
     publications has been presented as case study
 ο‚—   We argue that further investigations and experiments could be
     made to exploit IR in bioinformatics



February 16, 2011 – Valencia (Spain)
Acknowledgments

 ο‚— This work was partially supported by the Italian Ministry of
   Education – Investment funds for basic research, under the
   project ITALBIONET – Italian Network of Bioinformatics
 ο‚— I wish to thank all the IASC Group members for their valuable
   help
 ο‚— IASC Group members are:
      ο‚— G. Armano – head
      ο‚— A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc
      ο‚— A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students
      ο‚— S. Curatti – collaborator, programmer
 ο‚— I wish to thank also Andrea Manconi for his suggestions

February 16, 2011 – Valencia (Spain)
Thanks for your
           attention!
Contact: Eloisa Vargiu vargiu@diee.unica.it

February 16, 2011 – Valencia (Spain)

More Related Content

What's hot

Bioinformatics
BioinformaticsBioinformatics
Bioinformaticsbiinoida
Β 
Computational Biology and Bioinformatics
Computational Biology and BioinformaticsComputational Biology and Bioinformatics
Computational Biology and BioinformaticsSharif Shuvo
Β 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in BioinformaticsArindam Ghosh
Β 
Microarray Analysis
Microarray AnalysisMicroarray Analysis
Microarray AnalysisJames McInerney
Β 
Bioinformatics
BioinformaticsBioinformatics
Bioinformaticsnadimissimple
Β 
PubChem Database
PubChem DatabasePubChem Database
PubChem DatabaseLucia Ravi
Β 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformaticsVinaKhan1
Β 
Bioinformatics Applications in Biotechnology
Bioinformatics Applications in BiotechnologyBioinformatics Applications in Biotechnology
Bioinformatics Applications in BiotechnologyUshanandini Mohanraj
Β 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsDenis C. Bauer
Β 
Computational biology
Computational biologyComputational biology
Computational biologyZeina Abdelmoez
Β 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicsprateek kumar
Β 
Bioinformatics, its application main
Bioinformatics, its application mainBioinformatics, its application main
Bioinformatics, its application mainKAUSHAL SAHU
Β 

What's hot (20)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
Β 
Ensembl annotation
Ensembl annotationEnsembl annotation
Ensembl annotation
Β 
Ensembl genome
Ensembl genomeEnsembl genome
Ensembl genome
Β 
Bioinformatics in medicine
Bioinformatics in medicineBioinformatics in medicine
Bioinformatics in medicine
Β 
Computational Biology and Bioinformatics
Computational Biology and BioinformaticsComputational Biology and Bioinformatics
Computational Biology and Bioinformatics
Β 
Bioinformatics and Drug Discovery
Bioinformatics and Drug DiscoveryBioinformatics and Drug Discovery
Bioinformatics and Drug Discovery
Β 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
Β 
Microarray Analysis
Microarray AnalysisMicroarray Analysis
Microarray Analysis
Β 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
Β 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
Β 
PubChem Database
PubChem DatabasePubChem Database
PubChem Database
Β 
Genomics types
Genomics typesGenomics types
Genomics types
Β 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
Β 
Bioinformatics Applications in Biotechnology
Bioinformatics Applications in BiotechnologyBioinformatics Applications in Biotechnology
Bioinformatics Applications in Biotechnology
Β 
Ncbi
NcbiNcbi
Ncbi
Β 
Genomics
GenomicsGenomics
Genomics
Β 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
Β 
Computational biology
Computational biologyComputational biology
Computational biology
Β 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
Β 
Bioinformatics, its application main
Bioinformatics, its application mainBioinformatics, its application main
Bioinformatics, its application main
Β 

Viewers also liked

NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.Novosib-BIT LLC
Β 
Windows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & RetrievalWindows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & RetrievalSaviant Consulting
Β 
Realtime search engine concept
Realtime search engine conceptRealtime search engine concept
Realtime search engine conceptμƒμš± 솑
Β 
Developing Document Image Retrieval System
Developing Document Image Retrieval SystemDeveloping Document Image Retrieval System
Developing Document Image Retrieval SystemKonstantinos Zagoris
Β 
google search engine
google search enginegoogle search engine
google search engineway2go
Β 

Viewers also liked (6)

NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
Β 
Windows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & RetrievalWindows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & Retrieval
Β 
Text Indexing and Retrieval
Text Indexing and RetrievalText Indexing and Retrieval
Text Indexing and Retrieval
Β 
Realtime search engine concept
Realtime search engine conceptRealtime search engine concept
Realtime search engine concept
Β 
Developing Document Image Retrieval System
Developing Document Image Retrieval SystemDeveloping Document Image Retrieval System
Developing Document Image Retrieval System
Β 
google search engine
google search enginegoogle search engine
google search engine
Β 

Similar to Bioinformatics Meets Information Retrieval

BIOINFO unit 1.pptx
BIOINFO unit 1.pptxBIOINFO unit 1.pptx
BIOINFO unit 1.pptxrnath286
Β 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
Β 
Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu KAUSHAL SAHU
Β 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Bryan Heidorn
Β 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Datavbrant
Β 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databasesSangeeta Das
Β 
Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptxPagudalaSangeetha
Β 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONIJwest
Β 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION dannyijwest
Β 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in BioinformaticsMeghaj Mallick
Β 
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkRDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkASIS&T
Β 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...SBituila
Β 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...BibiQuinah
Β 
How do we know what we don’t know: Using the Neuroscience Information Framew...
How do we know what we don’t know:  Using the Neuroscience Information Framew...How do we know what we don’t know:  Using the Neuroscience Information Framew...
How do we know what we don’t know: Using the Neuroscience Information Framew...Maryann Martone
Β 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBryan Heidorn
Β 
Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011PrattSILS
Β 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
Β 
Nucleic acid and protein databanks
Nucleic acid and protein databanksNucleic acid and protein databanks
Nucleic acid and protein databanksNithyaNandapal
Β 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppSimon Jupp
Β 
Presentation (3).pptx
Presentation (3).pptxPresentation (3).pptx
Presentation (3).pptxramyasritekkala
Β 

Similar to Bioinformatics Meets Information Retrieval (20)

BIOINFO unit 1.pptx
BIOINFO unit 1.pptxBIOINFO unit 1.pptx
BIOINFO unit 1.pptx
Β 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
Β 
Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu
Β 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Β 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Data
Β 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databases
Β 
Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptx
Β 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
Β 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
Β 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in Bioinformatics
Β 
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkRDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
Β 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
Β 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
Β 
How do we know what we don’t know: Using the Neuroscience Information Framew...
How do we know what we don’t know:  Using the Neuroscience Information Framew...How do we know what we don’t know:  Using the Neuroscience Information Framew...
How do we know what we don’t know: Using the Neuroscience Information Framew...
Β 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary Challenge
Β 
Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011
Β 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
Β 
Nucleic acid and protein databanks
Nucleic acid and protein databanksNucleic acid and protein databanks
Nucleic acid and protein databanks
Β 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
Β 
Presentation (3).pptx
Presentation (3).pptxPresentation (3).pptx
Presentation (3).pptx
Β 

More from Eloisa Vargiu

Citizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of healthCitizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of healthEloisa Vargiu
Β 
Improving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and LleidaImproving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and LleidaEloisa Vargiu
Β 
Medical Technology in Sleep
Medical Technology in SleepMedical Technology in Sleep
Medical Technology in SleepEloisa Vargiu
Β 
Patient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care ApproachPatient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care ApproachEloisa Vargiu
Β 
Connected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in LleidaConnected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in LleidaEloisa Vargiu
Β 
Self-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A ProposalSelf-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A ProposalEloisa Vargiu
Β 
Patient Empowerment in CONNECARE
Patient Empowerment in CONNECAREPatient Empowerment in CONNECARE
Patient Empowerment in CONNECAREEloisa Vargiu
Β 
From Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-ManagementFrom Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-ManagementEloisa Vargiu
Β 
The CONNECARE Project
The CONNECARE ProjectThe CONNECARE Project
The CONNECARE ProjectEloisa Vargiu
Β 
Self-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experienceSelf-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experienceEloisa Vargiu
Β 
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...Eloisa Vargiu
Β 
Integrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic PatientsIntegrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic PatientsEloisa Vargiu
Β 
Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...Eloisa Vargiu
Β 
The CONNECARE project at a glance
The CONNECARE project at a glanceThe CONNECARE project at a glance
The CONNECARE project at a glanceEloisa Vargiu
Β 
Challenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECAREChallenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECAREEloisa Vargiu
Β 
Third Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the DifferenceThird Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the DifferenceEloisa Vargiu
Β 
Monitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons LearnedMonitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons LearnedEloisa Vargiu
Β 
Monitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experienceMonitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experienceEloisa Vargiu
Β 
Brain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons LearntBrain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons LearntEloisa Vargiu
Β 
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Eloisa Vargiu
Β 

More from Eloisa Vargiu (20)

Citizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of healthCitizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of health
Β 
Improving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and LleidaImproving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Β 
Medical Technology in Sleep
Medical Technology in SleepMedical Technology in Sleep
Medical Technology in Sleep
Β 
Patient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care ApproachPatient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care Approach
Β 
Connected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in LleidaConnected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in Lleida
Β 
Self-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A ProposalSelf-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A Proposal
Β 
Patient Empowerment in CONNECARE
Patient Empowerment in CONNECAREPatient Empowerment in CONNECARE
Patient Empowerment in CONNECARE
Β 
From Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-ManagementFrom Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-Management
Β 
The CONNECARE Project
The CONNECARE ProjectThe CONNECARE Project
The CONNECARE Project
Β 
Self-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experienceSelf-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experience
Β 
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
Β 
Integrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic PatientsIntegrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic Patients
Β 
Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...
Β 
The CONNECARE project at a glance
The CONNECARE project at a glanceThe CONNECARE project at a glance
The CONNECARE project at a glance
Β 
Challenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECAREChallenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECARE
Β 
Third Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the DifferenceThird Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the Difference
Β 
Monitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons LearnedMonitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons Learned
Β 
Monitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experienceMonitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experience
Β 
Brain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons LearntBrain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons Learnt
Β 
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Β 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
Β 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
Β 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Β 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
Β 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
Β 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
Β 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Β 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
Β 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
Β 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
Β 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
Β 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Β 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
Β 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Β 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
Β 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Β 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Β 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Β 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Β 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Β 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
Β 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Β 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Β 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
Β 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Β 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Β 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Β 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Β 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Β 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Β 

Bioinformatics Meets Information Retrieval

  • 1. Bioinformatics Meets Information Retrieval State of the Art and a Case Study Eloisa Vargiu Intelligent Agents and Soft-Computing Group Dept. of Electrical and Electronic Engineering University of Cagliari, Italy February 16, 2011 – Valencia (Spain) email: vargiu@diee.unica.it
  • 2. My Background ο‚— 2000 – 2004 ο‚— 2004 – 2009 ο‚— Automatic planning ο‚— Bioinformatics ο‚— Classic domains: HW[] ο‚— Protein secondary structure ο‚— Dynamic domains: HIPE prediction: MASSP3 and GAME/SSP ο‚— 2000 - … ο‚— 2006 - … ο‚— Multiage s te nt ys ms ο‚— Information Retrieval ο‚— A Personalized Adaptive and Cooperative Multiagent ο‚— Hierarchical text System: PACMAS categorization: PF and TSA ο‚— A generic architecture to ο‚— Recommender systems and perform information retrieval contextual advertising: ConCA tasks: X.MAS February 16, 2011 – Valencia (Spain)
  • 3. Outline ο‚— Context and Mission ο‚— Why Bioinformatics Needs Information Retrieval ο‚— Bioinformatics Meets Information Retrieval ο‚— Case Study: Retrieving and Filtering Bioinformatics Publications ο‚— Conclusions February 16, 2011 – Valencia (Spain)
  • 4. Context and Mission February 16, 2011 – Valencia (Spain)
  • 5. Web Evolution ο‚— Web 1.0 1993 ο‚— Source of information ο‚— Personal homepages ο‚— Web 2.0 2004 ο‚— Social networks ο‚— (Micro)Blogging ο‚— Web 3.0 2005 ο‚— Semantic web ο‚— Web composition February 16, 2011 – Valencia (Spain)
  • 6. Web Evolution and Bioinformatics ο‚— A long time ago... ο‚— Data was stored in local DBs ο‚— Data was shared as flat files ο‚— Biologists worked alone or in small groups February 16, 2011 – Valencia (Spain)
  • 7. Web Evolution and Bioinformatics ο‚— Today... ο‚— Online repositories ο‚— The major sources of nucleotide sequence are the ones belonging to the International Nucleotide Sequence Database Collaboration ο‚— DDBJ (DNA DataBank of Japan) ο‚— EMBL (European Molecular Biology Laboratory) ο‚— GenBank (NIH genetic sequence database) ο‚— Web services ο‚— Basic bioinformatics services are classified by the EBI into three categories ο‚— SSS (Sequence Search Services) ο‚— MSA (Multiple Sequence Alignment) ο‚— BSA (Biological Sequence Analysis) February 16, 2011 – Valencia (Spain)
  • 8. Web Evolution and Scientific Publications ο‚— A long time ago... ο‚— Publications were consulted at the library ο‚— Just two or three relevant available journals ο‚— Manual selection of relevant publications February 16, 2011 – Valencia (Spain)
  • 9. Web Evolution and Scientific Publications ο‚— Today... ο‚— Online journals ο‚— Online conference proceedings ο‚— Publications are often available for free ο‚— Manual selection of relevant publications becomes unfeasible February 16, 2011 – Valencia (Spain)
  • 10. As a Consequence... ο‚— Unstructured information ο‚— Information overload ο‚— Personalized information selection and input imbalance February 16, 2011 – Valencia (Spain)
  • 11. Our Mission ο‚— To cope with ο‚— Unstructured information, classifying documents according to a given taxonomy ο‚— Information overload, filtering information to reduce redundancy ο‚— Personalized information selection and input imbalance, filtering information according to user preferences ο‚— Case study ο‚— Retrieving and filtering bioinformatics publications February 16, 2011 – Valencia (Spain)
  • 12. Research Topics ο‚— Information Retrieval ο‚— Bioinformatics February 16, 2011 – Valencia (Spain)
  • 13. Information Retrieval Information Retrieval (IR) deals with the representation, Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. storage, organization of, and access to information items. The user must first translate this information need into a query The user must first translate this information need into a query which can be processed by an IR system. which can be processed by an IR system. Given the user query, the key goal of an IR system is to retrieve Given the user query, the key goal of an IR system is to retrieve information which might be useful or relevant to the user. information which might be useful or relevant to the user. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. New York: Addison-Wesley, 1999. February 16, 2011 – Valencia (Spain)
  • 14. Main IR Topics ο‚— Indexing ο‚— Search and Web Search ο‚— Information Filtering ο‚— Text Mining ο‚— Text Categorization and Hierarchical Text Categorization February 16, 2011 – Valencia (Spain)
  • 15. Bioinformatics Bioinformatics is the field of science in which biology, Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a computer science, and information technology merge to form a single discipline. single discipline. The ultimate goal of the field is to enable the discovery of new The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. which unifying principles in biology can be discerned. National Center for Biotechnology Information (NCBI), http://www.ncbi.nlm.nih.gov/. February 16, 2011 – Valencia (Spain)
  • 16. Main Bioinformatics Research Areas ο‚— Sequence analysis ο‚— Genome annotation ο‚— Computational evolutionary biology ο‚— Analysis of gene expression ο‚— Analysis of protein expression ο‚— Analysis of mutations in cancer ο‚— Comparative genomics ο‚— Modelling biological systems ο‚— Prediction of protein structure ο‚— Molecular interaction February 16, 2011 – Valencia (Spain)
  • 17. Why Bioinformatics Needs Information Retrieval February 16, 2011 – Valencia (Spain)
  • 18. Does Bioinformatics Need IR? ο‚— Bioinformatics is concerned with researching, developing and applying tools and methods to acquire, analyse, organize and store biological and medical data ο‚— Indexing and search techniques may help in the task of acquiring ο‚— Information filtering, text mining and text categorization techniques may be useful to the analysis of data ο‚— Text categorization, with particular reference to hierarchical text categorization, may be used in the organization and storage tasks February 16, 2011 – Valencia (Spain)
  • 19. Bioinformatics Data ο‚— A very huge amount of of data to be ο‚— Indexed ο‚— Searched for in large databases or on the web ο‚— Filtered according to users' preferences ο‚— Text mined ο‚— Categorized according to its textual content February 16, 2011 – Valencia (Spain)
  • 20. DB Indexing ο‚— Why ο‚— Data types are relegated to blob and unstructured text fields ο‚— Few results in building persistent access paths to support fast retrieval methods ο‚— Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample ο‚— Annotations are not mapped to concept in any ontology February 16, 2011 – Valencia (Spain)
  • 21. DB Indexing ο‚— Who ο‚— MoBIoS – Molecular Biological Information System ο‚— What ο‚— A specialized database management system ο‚— The storage manager is based on metric-space indexing ο‚— Query language entails biological data types ο‚— Where ο‚— Sequence homology: local alignment and mutations D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to Support Biological Discovery. Proceedings of the International Conference on Scientific and Statistical Database Management Systems, 2003. February 16, 2011 – Valencia (Spain)
  • 22. DB Indexing ο‚— Who ο‚— -- ο‚— What ο‚— Ontology-driven indexing of public datasets for translational bioinformatics ο‚— Methods to map text annotations of gene expression datasets to concept in the UMLS ο‚— Where ο‚— Gene Expression Omnibus ο‚— Standford Tissue Microarray Database N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A. . Musen. Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009. February 16, 2011 – Valencia (Spain)
  • 23. Web Indexing ο‚— Why ο‚— Most often sequence retrieval tools and sequence analysis tools are separated ο‚— The usage of sequence DBs is often general and limited to keyword searching and entry retrieval ο‚— Discovering and accessing the appropriate bioinformatics resource for a specific task has become increasingly important February 16, 2011 – Valencia (Spain)
  • 24. Web Indexing ο‚— Who ο‚— SIRW – A Web Server for Simple Indexing and Retrieval System ο‚— What ο‚— A WWW interface to the Simple Indexing and Retrieval (SIR) system to parse and index flat file DBs ο‚— A framework for doing sequence analysis for selected biological sequences ο‚— Where ο‚— Sequence analysis: motif pattern searches C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval System that combines sequence motif searches with keyword searches. Nucleic Acids Research, 31(13). pp. 3771-3774, 2003. February 16, 2011 – Valencia (Spain)
  • 25. Web Indexing ο‚— Who ο‚— BIRI - BIoinformatics Resource Inventory ο‚— What ο‚— An approach for automatically discovering and indexing public bioinformatics resources ο‚— Where ο‚— The scientific literature G. de la Calle, M. GarcΓ­a-Remesal, S. Chiesa, D. de la Iglesia, V. Maojo. BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC Bioinformatics, Oct 7;10:320, 2009. February 16, 2011 – Valencia (Spain)
  • 26. DB Search ο‚— Why ο‚— A wealth of bioinformatics tools and databases has been created over the last decade and most are freely available ο‚— Often it is desired to visualize the database hits stacked according to the query sequence ο‚— There is no inventory presenting an up-to-date and easily searchable index of all these resources February 16, 2011 – Valencia (Spain)
  • 27. DB Search ο‚— Who ο‚— MView – Multiple alignment Viewer ο‚— What ο‚— A tool for converting the result of a sequence database search into the form of a coloured multiple alignment of hits stacked against the query ο‚— Where ο‚— Multiple alignment N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible . database search or multiple alignment viewer. Bioinformatics, 14(4), pp. 380-381, 1998. February 16, 2011 – Valencia (Spain)
  • 28. DB Search ο‚— Who ο‚— BioWareDB ο‚— What ο‚— An extensive and current catalog of software and DBs of relevance to researchers in the field of biology and medicine ο‚— Where ο‚— Current and available biomedical computing resources M.W. Matthiessen. BioWareDB: the biomedical software and database search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003. February 16, 2011 – Valencia (Spain)
  • 29. Web Search ο‚— Why ο‚— Today, scientists can easily post their research findings on the Web or compare their discoveries with previous work ο‚— Manually maintaining a wrapper library will not scale to accommodate the growth of genomics data sources on the Web February 16, 2011 – Valencia (Spain)
  • 30. Web Search ο‚— Who ο‚— --- ο‚— What ο‚— An automated system able to find, classify, and wrap new sources without constant human intervention ο‚— Where ο‚— Distributed genomics data sources D. Rocco and T. Critchlow. Automatic discovery and classification of bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933, 2003. February 16, 2011 – Valencia (Spain)
  • 31. Web Search ο‚— Who ο‚— GoPubMed ο‚— What ο‚— An ontology-based literature search applied to Gene Ontology (GO) and PubMed ο‚— Where ο‚— Scientific literature R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed: ontology-based literature search applied to gene ontology and PubMed. In Proceedings of German Bioinformatics Conference, pp. 169–178, 2004. February 16, 2011 – Valencia (Spain)
  • 32. Information Filtering ο‚— Why ο‚— In the Web 2.0 scenario, users look for collaborative environments, in which they can meet further users with similar preferences and needs ο‚— Researchers need to search for and/or generate specialized datasets that meet specific requirements February 16, 2011 – Valencia (Spain)
  • 33. Information Filtering ο‚— Who ο‚— ProDaMa-C Protein Dataset Management – Collaborative ο‚— What ο‚— A web application aimed at ο‚— Generating specialized protein structure datasets ο‚— Favouring the collaboration among researchers ο‚— Where ο‚— Protein structures G. Armano and A. Manconi. A Collaborative Web Application for Supporting Researchers in the Task of Generating Protein Datasets. Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A. Soro, E. Vargiu (eds.), Springer-Verlag, 2011. February 16, 2011 – Valencia (Spain)
  • 34. Information Filtering ο‚— Who ο‚— Gene Recommender ο‚— What ο‚— An algorithm that ranks genes according to how strongly they correlate with a set of query genes ο‚— Where ο‚— Analysis of gene expression A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene recommender algorithm to identify coexpressed genes. Genome Research, Aug;13(8), pp. 1828-37, 2003. February 16, 2011 – Valencia (Spain)
  • 35. Text Mining ο‚— Why ο‚— Web-based tools capable of filtering public DBs are more and more required ο‚— Interesting and useful information, relevant to the researcher, could appear in documents (e.g., papers) they have not read and therefore be missed entirely ο‚— Of paramount importance to DB search methods is a reliable means of distinguishing true hits from false hits ο‚— Biologists construct a pathway by reading a large number of articles and interpreting them a consistent network, but the link to the original article is missed February 16, 2011 – Valencia (Spain)
  • 36. Text Mining ο‚— Who ο‚— MedMiner ο‚— What ο‚— An Internet text mining tool that filters the literature and presents the most relevant portions in a well-organized way that facilitate understanding ο‚— Where ο‚— Gene expression profiling L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N. Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling. Biotechniques, Dec;27(6), pp. 1210-4, 1999. February 16, 2011 – Valencia (Spain)
  • 37. Text Mining ο‚— Who ο‚— BioRAT ο‚— What ο‚— A research assistant that, given a query, ο‚— autonomously finds a set of papers ο‚— reads them ο‚— highlights the most relevant facts in each ο‚— Where ο‚— Scientific literature D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones. . BioRAT: Extracting biological information from full-length papers. Bioinformatics, 20(17), pp. 3206–3213, 2004. February 16, 2011 – Valencia (Spain)
  • 38. Text Mining ο‚— Who ο‚— SAWTED – Structure Assignment With Text Description ο‚— What ο‚— An automated system to filtering DB hits ο‚— Where ο‚— Homologues annotation R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure assignment with text description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics, Feb;16(2), pp. 125-9, 2000. February 16, 2011 – Valencia (Spain)
  • 39. Text Mining ο‚— Who ο‚— PathText ο‚— What ο‚— A system to integrate a pathway visualized, text mining systems and annotation tools into a seamless environment ο‚— Where ο‚— Pathway visualizations B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S. Ananiadou, and J. Tsujii. PathText: a text mining integrator for biological pathway visualizations. Bioinformatics, 26(12), pp. i374- i381, 2010. February 16, 2011 – Valencia (Spain)
  • 40. Text Categorization ο‚— Why ο‚— Information in text form, such as MEDLINE records, is a greatly underutilized source of biological information ο‚— Individual researchers find it difficult to keep up with all the new, relevant information ο‚— Systems that extract structured information from natural language passages have been highly successful in specialized domains ο‚— Time is ripe for developing such applications for molecular biology and genomics February 16, 2011 – Valencia (Spain)
  • 41. Text Categorization ο‚— Who ο‚— -- ο‚— What ο‚— Constructing biological knowledge bases by extracting information from text sources ο‚— Where ο‚— MEDLINE M. Craven and J. Kumlien. Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, 1999. February 16, 2011 – Valencia (Spain)
  • 42. Text Categorization ο‚— Who ο‚— Genies ο‚— What ο‚— A natural-language processing system for the extraction of molecular pathways ο‚— Where ο‚— Scientific publications C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies: . a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001. February 16, 2011 – Valencia (Spain)
  • 43. Hierarchical Text Categorization ο‚— Why ο‚— A great deal of genomics information accumulated through years is available in online text repositories (such as MEDLINE) ο‚— These resources do not still provide adequate mechanisms for retrieving the required information ο‚— Traditional filtering techniques based on keyword search are often inadequate to express what the user is really searching for ο‚— Web repositories, such as Medical Subject Headings (MeSH) in MEDLINE, encompass an underlying taxonomy February 16, 2011 – Valencia (Spain)
  • 44. Hierarchical Text Categorization ο‚— Who ο‚— -- ο‚— What ο‚— A tool for assisting biologists with literature search for the task of associating genes with Gene Ontology codes ο‚— Where ο‚— MEDLINE S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text categorization as a tool of associating genes with gene ontology codes. In 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, pp. 26–30, 2004. February 16, 2011 – Valencia (Spain)
  • 45. Hierarchical Text Categorization ο‚— Who ο‚— Pub.MAS ο‚— What ο‚— A multiagent system for retrieving and classifying publications ο‚— Where ο‚— BMC Bioinformatics ο‚— PubMed Central G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for Retrieving Bioinformatics Publications from Web Sources. IEEE Transactions on Nanobioscience, Special Session on GRID, Web Services, Software Agents and Ontology Applications for Life Science, 6(2), pp. 104-109, 2007. February 16, 2011 – Valencia (Spain)
  • 46. Case Study: Retrieving and Filtering Bioinformatic Publications February 16, 2011 – Valencia (Spain)
  • 47. An IR Task Information Extraction Online Repositories Wrapping Information Sources Extracted Data/Information Text Categorization Selected Data/Information Taxonomic Classification of Items User's Feedback Adaptive Behavior February 16, 2011 – Valencia (Spain)
  • 48. Information Extraction ο‚— Essential to retrieve documents provided by heterogeneous and distributed sources A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) : A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp. 84–93. February 16, 2011 – Valencia (Spain)
  • 49. Text Categorization ο‚— It is the task of determining and assigning topical labels to content ο‚— Typical approaches to text categorization ο‚— Statistical ο‚— Semantic ο‚— In the last years several researchers have investigated the use of hierarchies for text categorization F. Sebastiani. A tutorial on automated text categorisation. Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7- 35, 1999. February 16, 2011 – Valencia (Spain)
  • 50. Users' Feedback ο‚— It is aimed at dealing with any feedback provided by the user ο‚— In semiautomated classification and adaptive filtering we may expect the user of a classifier to provide feedback on how test documents have been classified ο‚— In this case further training may be performed during the operating phase February 16, 2011 – Valencia (Spain)
  • 51. Hierarchical Text Categorization Hierarchical Text Categorization (HTC) deals with problems Hierarchical Text Categorization (HTC) deals with problems where categories are organized in the form of a hierarchy. where categories are organized in the form of a hierarchy. D. Koller, M. Sahami. Hierarchically classifying documents using very few words. Proceedings of 14th International Conference on Machine Learning, pp. 170– 178, 1997. February 16, 2011 – Valencia (Spain)
  • 52. HTC at a Glance ο‚— HTC studies how to improve the performances provided by classical text categorization techniques by exploiting the knowledge of the taxonomic relationships among classes February 16, 2011 – Valencia (Spain)
  • 53. Motivations ο‚— People organize large collections of documents in hierarchies of topics, or arrange a large body of knowledge in ontologies ο‚— The main goal of automatic text categorization is to deal with underlying taxonomies ο‚— A hierarchical approach can give benefits in real-world scenarios, characterized by information overload and imbalanced data February 16, 2011 – Valencia (Spain)
  • 54. HTC Approaches ο‚— Pachinko machine ο‚— At each level of the hierarchy ο‚— The classifier selects the one most probable category ο‚— It goes down the hierarchy inspecting only the children of the selected nodes ο‚— Probabilistic hierarchical local approach ο‚— At each level of the hierarchy ο‚— The classifier makes probabilistic decisions ο‚— It selects the leaf categories on the most probable paths S. Kiritchenko. Hierarchical text categorization and its application to bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006. February 16, 2011 – Valencia (Spain)
  • 55. HTC Approaches ο‚— Local classifier per node ο‚— Each classifier decides if forwarding the document to its children ο‚— Local classifier per parent node ο‚— Each classifier decides to which subtree(s) the document should be sent to ο‚— Local classifier per level ο‚— The number of outputs per level grows while going down through the taxonomy ο‚— Global classifier ο‚— One classifier is trained, able to discriminate among all categories C.J. Silla and A. Freitas. A survey on hierarchical classification across different application domains. Journal of Data Mining and Knowledge Discovery, 2(1-2), pp. 31-72, 2010. February 16, 2011 – Valencia (Spain)
  • 56. Progressive Filtering ο‚— Progressive Filtering (PF) is a simple categorization technique that operates on hierarchically structured categories ο‚— A way to implement PF consists of decomposing a given rooted taxonomy into pipelines, one for of each path that exists between the root and each node of the taxonomy ο‚— Each node is a binary classifier able to recognize whether or not an input belongs to the corresponding class ο‚— A threshold selection algorithm (TSA) can be run to identify an optimal, or sub-optimal, combination of thresholds for each pipeline A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to Perform Hierarchical Text Categorization in Presence of Input Imbalance. Proceedings of International Conference on Knowledge February 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010. Discovery and
  • 57. PF at a Glance ο‚— Starting from the root, each input traverses the taxonomy as a β€œtoken” February 16, 2011 – Valencia (Spain)
  • 58. Classifiers in PF ο‚— Partitioning the taxonomy in pipelines gives rise to a set of new classifiers, each represented by a pipeline February 16, 2011 – Valencia (Spain)
  • 59. Classifiers in PF February 16, 2011 – Valencia (Spain)
  • 60. Classifiers in PF ο‚— The same classifier may have different behaviours, depending on which pipeline it is embedded ο‚— Each pipeline can be considered in isolation from the others February 16, 2011 – Valencia (Spain)
  • 61. Threshold Selection in PF ο‚— A relevant problem is how to calibrate the threshold of the binary classifiers embedded by each pipeline in order to optimize the pipeline behaviour ο‚— Searching for a optimal or sub-optimal combination of thresholds in a pipeline can be actually viewed as the problem of finding a maximum in a utility function F that depends on the corresponding threshold vector ΞΈ February 16, 2011 – Valencia (Spain)
  • 62. TSA ο‚— For each pipeline the best combination of thresholds is calculated according to a bottom up algorithm that uses two functions ο‚— Repair which increases/decreases (↑ / ↓ the threshold until the ) utility function reaches a maximum ο‚— Calibrate which recursively operates downward from the given classifier by repeatedly calling repair (↑ / ↓) A. Addis, G. Armano, E. Vargiu. A comparative experimental assessment of a threshold selection algorithm in hierarchical text categorization. In: Advances in Information Retrieval. The 33rd European Conference on Information Retrieval (ECIR 2011), 2011 February 16, 2011 – Valencia (Spain)
  • 63. TSA: An Example February 16, 2011 – Valencia (Spain)
  • 64. The Prototype ο‚— MultiAgent Architecture ο‚— X.MAS ο‚— Agent Framework ο‚— JADE A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6, Sixth International Workshop, From Agent Theory to Agent Implementation, pp. 3–9, 2008. F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent Systems with JADE (Wiley Series in Agent Technology). John Wiley and Sons, 2007. February 16, 2011 – Valencia (Spain)
  • 65. X.MAS at a Glance ο‚— Macro-architecture February 16, 2011 – Valencia (Spain)
  • 66. X.MAS at a Glance Information Agent Scheduler Source ο‚— Micro-architecture Middle Agent Scheduler Dispatcher Filter Agent Scheduler Actuator Middle Agent Scheduler Dispatcher Task Agent Scheduler Actuator Middle Agent Scheduler Dispatcher Interface Agent Scheduler February 16, 2011 – Valencia (Spain)
  • 67. Pub.MAS February 16, 2011 – Valencia (Spain)
  • 68. Pub.MAS G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for Retrieving Bioinformatics Publications from Web Sources. IEEE Transactions on Nanobioscience, Special Session on GRID, Web Services, Software Agents and Ontology Applications for Life Science, 6(2), pp. 104-109, 2007. February 16, 2011 – Valencia (Spain)
  • 69. Information Extraction ο‚— It is supported by a set of agents explicitly devoted to ο‚— wrap the selected information sources ο‚— encode the extracted documents ο‚— An information agent wraps BMC Bioinformatics web site ο‚— HTML wrapper ο‚— An information agent wraps PubMed Central digital archive ο‚— Web service wrapper February 16, 2011 – Valencia (Spain)
  • 70. Hierarchical Text Categorization ο‚— The PF approach previously described has been implemented ο‚— Document has been encoded to ο‚— remove all non-informative words ο‚— remove the most common morphological and inflexional suffixes ο‚— select the relevant features ο‚— generate a feature vector for each document ο‚— Classification is performed by wkNN classifiers ο‚— the score is assigned using non parametric density estimation of the β€œ a posteriori” probability February 16, 2011 – Valencia (Spain)
  • 71. The Adopted Taxonomy P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A. . Brass. An ontology for bioinformatics applications, Bioinformatics, 15(6), pp. 510–520, 1999. February 16, 2011 – Valencia (Spain)
  • 72. The Adopted Taxonomy February 16, 2011 – Valencia (Spain)
  • 73. The Adopted Taxonomy February 16, 2011 – Valencia (Spain)
  • 74. Users' Feedback ο‚— User feedback is aimed at dealing with any feedback provided by the user ο‚— Two solutions have been experimented ο‚— training an ANN ο‚— using a kNN classifier February 16, 2011 – Valencia (Spain)
  • 75. Experiments ο‚— Different kinds of tests have been performed, each aimed at highlighting a specific issue ο‚— we estimated the (normalized) confusion matrix for each classifier belonging to the highest level of the taxonomy ο‚— we studied the impact of taking into account pipelines of classifiers, also trying to assess whether a residual independence was in fact present ο‚— we assessed the solution devised for implementing user’s feedback, based on the k-NN technique February 16, 2011 – Valencia (Spain)
  • 76. Experiments ο‚— Tests have been performed using selected publications extracted from the BMC Bioinformatics site and from the PubMed Central digital archive ο‚— Publications have been classified by an expert of the domain according to the proposed taxonomy ο‚— For each item of the taxonomy, a set of about 100-150 articles has been selected to train the corresponding wk-NN classifier, and 300-400 articles have been used to test it February 16, 2011 – Valencia (Spain)
  • 77. Conclusions February 16, 2011 – Valencia (Spain)
  • 78. Conclusions ο‚— Bioinformatics needs suitable, automated, and β€œ intelligent” solutions to acquire, analyse, organize, and store biological data ο‚— IR might be very useful to face with bioinformatics problems ο‚— Currently, few IR techniques have been adopted to solve some bioinformatics tasks ο‚— A system aimed at retrieving and filtering bioinformatics publications has been presented as case study ο‚— We argue that further investigations and experiments could be made to exploit IR in bioinformatics February 16, 2011 – Valencia (Spain)
  • 79. Acknowledgments ο‚— This work was partially supported by the Italian Ministry of Education – Investment funds for basic research, under the project ITALBIONET – Italian Network of Bioinformatics ο‚— I wish to thank all the IASC Group members for their valuable help ο‚— IASC Group members are: ο‚— G. Armano – head ο‚— A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc ο‚— A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students ο‚— S. Curatti – collaborator, programmer ο‚— I wish to thank also Andrea Manconi for his suggestions February 16, 2011 – Valencia (Spain)
  • 80. Thanks for your attention! Contact: Eloisa Vargiu vargiu@diee.unica.it February 16, 2011 – Valencia (Spain)