SlideShare a Scribd company logo
1 of 80
Bioinformatics Meets
               Information Retrieval
   State of the Art and a Case Study
                                              Eloisa Vargiu



                           Intelligent Agents and Soft-Computing Group
                          Dept. of Electrical and Electronic Engineering
                                       University of Cagliari, Italy
February 16, 2011 – Valencia (Spain)   email: vargiu@diee.unica.it
My Background

  2000 – 2004                                 2004 – 2009
     Automatic planning                          Bioinformatics
             Classic domains: HW[]                  Protein secondary structure
             Dynamic domains: HIPE                   prediction: MASSP3 and
                                                      GAME/SSP
  2000 - …
                                               2006 - …
     Multiage s te
              nt ys ms
                                                  Information Retrieval
             A Personalized Adaptive and
              Cooperative Multiagent                 Hierarchical text
              System: PACMAS                          categorization: PF and TSA
             A generic architecture to              Recommender systems and
              perform information retrieval           contextual advertising: ConCA
              tasks: X.MAS


February 16, 2011 – Valencia (Spain)
Outline

    Context and Mission
    Why Bioinformatics Needs Information Retrieval
    Bioinformatics Meets Information Retrieval
    Case Study: Retrieving and Filtering Bioinformatics Publications
    Conclusions




February 16, 2011 – Valencia (Spain)
Context and Mission




February 16, 2011 – Valencia (Spain)
Web Evolution

  Web 1.0                             1993

    Source of information
    Personal homepages
  Web 2.0                             2004
    Social networks
    (Micro)Blogging
  Web 3.0                             2005

    Semantic web
    Web composition




February 16, 2011 – Valencia (Spain)
Web Evolution and Bioinformatics

  A long time ago...
     Data was stored in local DBs
     Data was shared as flat files
     Biologists worked alone or in small groups




February 16, 2011 – Valencia (Spain)
Web Evolution and Bioinformatics

  Today...
     Online repositories
             The major sources of nucleotide sequence are the ones belonging to the
              International Nucleotide Sequence Database Collaboration
                DDBJ (DNA DataBank of Japan)

                EMBL (European Molecular Biology Laboratory)

                GenBank (NIH genetic sequence database)

       Web services
         Basic bioinformatics services are
          classified by the EBI into three categories
            SSS (Sequence Search Services)

            MSA (Multiple Sequence Alignment)

            BSA (Biological Sequence Analysis)



February 16, 2011 – Valencia (Spain)
Web Evolution and Scientific
Publications
  A long time ago...
     Publications were consulted at the library
     Just two or three relevant available journals
     Manual selection of relevant publications




February 16, 2011 – Valencia (Spain)
Web Evolution and Scientific
Publications
  Today...
     Online journals
     Online conference proceedings
     Publications are often available for free
     Manual selection of relevant publications
      becomes unfeasible




February 16, 2011 – Valencia (Spain)
As a Consequence...

  Unstructured information
  Information overload
  Personalized information selection and input imbalance




February 16, 2011 – Valencia (Spain)
Our Mission

  To cope with
     Unstructured information, classifying documents according to a
      given taxonomy
     Information overload, filtering information to reduce redundancy
     Personalized information selection and input imbalance, filtering
      information according to user preferences
  Case study
     Retrieving and filtering bioinformatics publications




February 16, 2011 – Valencia (Spain)
Research Topics

  Information Retrieval
  Bioinformatics




February 16, 2011 – Valencia (Spain)
Information Retrieval

 Information Retrieval (IR) deals with the representation,
  Information Retrieval (IR) deals with the representation,
 storage, organization of, and access to information items.
  storage, organization of, and access to information items.

 The user must first translate this information need into a query
 The user must first translate this information need into a query
 which can be processed by an IR system.
 which can be processed by an IR system.

 Given the user query, the key goal of an IR system is to retrieve
  Given the user query, the key goal of an IR system is to retrieve
 information which might be useful or relevant to the user.
  information which might be useful or relevant to the user.


                R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
                New York: Addison-Wesley, 1999.

February 16, 2011 – Valencia (Spain)
Main IR Topics

    Indexing
    Search and Web Search
    Information Filtering
    Text Mining
    Text Categorization and Hierarchical Text Categorization




February 16, 2011 – Valencia (Spain)
Bioinformatics

 Bioinformatics is the field of science in which biology,
  Bioinformatics is the field of science in which biology,
 computer science, and information technology merge to form a
  computer science, and information technology merge to form a
 single discipline.
  single discipline.

 The ultimate goal of the field is to enable the discovery of new
 The ultimate goal of the field is to enable the discovery of new
 biological insights as well as to create a global perspective from
 biological insights as well as to create a global perspective from
 which unifying principles in biology can be discerned.
 which unifying principles in biology can be discerned.



                National Center for Biotechnology Information (NCBI),
                http://www.ncbi.nlm.nih.gov/.

February 16, 2011 – Valencia (Spain)
Main Bioinformatics Research Areas

    Sequence analysis
    Genome annotation
    Computational evolutionary biology
    Analysis of gene expression
    Analysis of protein expression
    Analysis of mutations in cancer
    Comparative genomics
    Modelling biological systems
    Prediction of protein structure
    Molecular interaction

February 16, 2011 – Valencia (Spain)
Why Bioinformatics
                       Needs
                Information Retrieval




February 16, 2011 – Valencia (Spain)
Does Bioinformatics Need IR?

  Bioinformatics is concerned with researching, developing and
    applying tools and methods to acquire, analyse, organize and
    store biological and medical data

  Indexing and search techniques may help in the task of acquiring
  Information filtering, text mining and text categorization
   techniques may be useful to the analysis of data
  Text categorization, with particular reference to hierarchical text
   categorization, may be used in the organization and storage tasks



February 16, 2011 – Valencia (Spain)
Bioinformatics Data

  A very huge amount of of data to be
     Indexed
     Searched for in large databases or on the web
     Filtered according to users' preferences
     Text mined
     Categorized according to its textual content




February 16, 2011 – Valencia (Spain)
DB Indexing

  Why
    Data types are relegated to blob and unstructured text fields
    Few results in building persistent access paths to support fast
     retrieval methods
    Genomic datasets in public repositories are annotated with free-text
     fields describing the pathological state of the studied sample
    Annotations are not mapped to concept in any ontology




February 16, 2011 – Valencia (Spain)
DB Indexing

  Who
    MoBIoS – Molecular Biological Information System
  What
    A specialized database management system
    The storage manager is based on metric-space indexing
    Query language entails biological data types
  Where
    Sequence homology: local alignment and mutations


                D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to
                Support Biological Discovery. Proceedings of the International
                Conference on Scientific and Statistical Database Management
                Systems, 2003.
February 16, 2011 – Valencia (Spain)
DB Indexing

 Who
   --
 What
   Ontology-driven indexing of public datasets for translational
    bioinformatics
   Methods to map text annotations of gene expression datasets to
    concept in the UMLS
 Where
   Gene Expression Omnibus
   Standford Tissue Microarray Database

                N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A.
                                           .
                Musen. Ontology-driven indexing of public datasets for translational
                bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009.
February 16, 2011 – Valencia (Spain)
Web Indexing

  Why
    Most often sequence retrieval tools and sequence analysis tools are
     separated
    The usage of sequence DBs is often general and limited to
     keyword searching and entry retrieval
    Discovering and accessing the appropriate bioinformatics resource
     for a specific task has become increasingly important




February 16, 2011 – Valencia (Spain)
Web Indexing

  Who
    SIRW – A Web Server for Simple Indexing and Retrieval System
  What
    A WWW interface to the Simple Indexing and Retrieval (SIR)
     system to parse and index flat file DBs
    A framework for doing sequence analysis for selected biological
     sequences
  Where
    Sequence analysis: motif pattern searches

                C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval
                System that combines sequence motif searches with keyword searches.
                Nucleic Acids Research, 31(13). pp. 3771-3774, 2003.

February 16, 2011 – Valencia (Spain)
Web Indexing

  Who
    BIRI - BIoinformatics Resource Inventory
  What
    An approach for automatically discovering and indexing public
     bioinformatics resources
  Where
    The scientific literature




                G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V.
                Maojo. BIRI: a new approach for automatically discovering and
                indexing available public bioinformatics resources from the literature.
                BMC Bioinformatics, Oct 7;10:320, 2009.
February 16, 2011 – Valencia (Spain)
DB Search

  Why
    A wealth of bioinformatics tools and databases has been created
     over the last decade and most are freely available
    Often it is desired to visualize the database hits stacked according
     to the query sequence
    There is no inventory presenting an up-to-date and easily
     searchable index of all these resources




February 16, 2011 – Valencia (Spain)
DB Search

  Who
    MView – Multiple alignment Viewer
  What
    A tool for converting the result of a sequence database search into
     the form of a coloured multiple alignment of hits stacked against
     the query
  Where
    Multiple alignment


                N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible
                   .
                database search or multiple alignment viewer. Bioinformatics, 14(4), pp.
                380-381, 1998.

February 16, 2011 – Valencia (Spain)
DB Search

  Who
    BioWareDB
  What
    An extensive and current catalog of software and DBs of relevance
     to researchers in the field of biology and medicine
  Where
    Current and available biomedical computing resources




                M.W. Matthiessen. BioWareDB: the biomedical software and database
                search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003.


February 16, 2011 – Valencia (Spain)
Web Search

  Why
    Today, scientists can easily post their research findings on the Web
     or compare their discoveries with previous work
    Manually maintaining a wrapper library will not scale to
     accommodate the growth of genomics data sources on the Web




February 16, 2011 – Valencia (Spain)
Web Search

  Who
    ---
  What
    An automated system able to find, classify, and wrap new sources
     without constant human intervention
  Where
    Distributed genomics data sources




                D. Rocco and T. Critchlow. Automatic discovery and classification of
                bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933,
                2003.

February 16, 2011 – Valencia (Spain)
Web Search

  Who
    GoPubMed
  What
    An ontology-based literature search applied to Gene Ontology
     (GO) and PubMed
  Where
    Scientific literature



                R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed:
                ontology-based literature search applied to gene ontology and PubMed.
                In Proceedings of German Bioinformatics Conference, pp. 169–178,
                2004.
February 16, 2011 – Valencia (Spain)
Information Filtering

  Why
    In the Web 2.0 scenario, users look for collaborative environments,
     in which they can meet further users with similar preferences and
     needs
    Researchers need to search for and/or generate specialized datasets
     that meet specific requirements




February 16, 2011 – Valencia (Spain)
Information Filtering

  Who
    ProDaMa-C Protein Dataset Management – Collaborative
  What
    A web application aimed at
             Generating specialized protein structure datasets
             Favouring the collaboration among researchers
  Where
    Protein structures


                G. Armano and A. Manconi. A Collaborative Web Application for
                Supporting Researchers in the Task of Generating Protein Datasets.
                Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A.
                Soro, E. Vargiu (eds.), Springer-Verlag, 2011.
February 16, 2011 – Valencia (Spain)
Information Filtering

  Who
    Gene Recommender
  What
    An algorithm that ranks genes according to how strongly they
     correlate with a set of query genes
  Where
    Analysis of gene expression




                A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene
                recommender algorithm to identify coexpressed genes. Genome
                Research, Aug;13(8), pp. 1828-37, 2003.

February 16, 2011 – Valencia (Spain)
Text Mining

  Why
    Web-based tools capable of filtering public DBs are more and more
     required
    Interesting and useful information, relevant to the researcher, could
     appear in documents (e.g., papers) they have not read and therefore
     be missed entirely
    Of paramount importance to DB search methods is a reliable
     means of distinguishing true hits from false hits
    Biologists construct a pathway by reading a large number of
     articles and interpreting them a consistent network, but the link to
     the original article is missed


February 16, 2011 – Valencia (Spain)
Text Mining

  Who
    MedMiner
  What
    An Internet text mining tool that filters the literature and presents
     the most relevant portions in a well-organized way that facilitate
     understanding
  Where
    Gene expression profiling

                L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N.
                Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical
                Information, with Application to Gene Expression Profiling.
                Biotechniques, Dec;27(6), pp. 1210-4, 1999.
February 16, 2011 – Valencia (Spain)
Text Mining

  Who
    BioRAT
  What
    A research assistant that, given a query,
             autonomously finds a set of papers
             reads them
             highlights the most relevant facts in each
  Where
    Scientific literature

                D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones.
                    .
                BioRAT: Extracting biological information from full-length papers.
                Bioinformatics, 20(17), pp. 3206–3213, 2004.

February 16, 2011 – Valencia (Spain)
Text Mining

  Who
    SAWTED – Structure Assignment With Text Description
  What
    An automated system to filtering DB hits
  Where
    Homologues annotation




                R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure
                assignment with text description-enhanced detection of remote
                homologues with automated SWISS-PROT annotation comparisons.
                Bioinformatics, Feb;16(2), pp. 125-9, 2000.
February 16, 2011 – Valencia (Spain)
Text Mining

  Who
    PathText
  What
    A system to integrate a pathway visualized, text mining systems
     and annotation tools into a seamless environment
  Where
    Pathway visualizations



                B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S.
                Ananiadou, and J. Tsujii. PathText: a text mining integrator for
                biological pathway visualizations. Bioinformatics, 26(12), pp. i374-
                i381, 2010.
February 16, 2011 – Valencia (Spain)
Text Categorization

  Why
    Information in text form, such as MEDLINE records, is a greatly
     underutilized source of biological information
    Individual researchers find it difficult to keep up with all the new,
     relevant information
    Systems that extract structured information from natural language
     passages have been highly successful in specialized domains
    Time is ripe for developing such applications for molecular biology
     and genomics




February 16, 2011 – Valencia (Spain)
Text Categorization

  Who
    --
  What
    Constructing biological knowledge bases by extracting information
     from text sources
  Where
    MEDLINE



                M. Craven and J. Kumlien. Constructing Biological Knowledge Bases
                by Extracting Information from Text Sources. In Proceedings of the 7th
                International Conference on Intelligent Systems for Molecular Biology,
                1999.
February 16, 2011 – Valencia (Spain)
Text Categorization

  Who
    Genies
  What
    A natural-language processing system for the extraction of
     molecular pathways
  Where
    Scientific publications



                C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies:
                               .
                a natural-language processing system for the extraction of molecular
                pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001.

February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

  Why
    A great deal of genomics information accumulated through years is
     available in online text repositories (such as MEDLINE)
    These resources do not still provide adequate mechanisms for
     retrieving the required information
    Traditional filtering techniques based on keyword search are often
     inadequate to express what the user is really searching for
    Web repositories, such as Medical Subject Headings (MeSH) in
     MEDLINE, encompass an underlying taxonomy




February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

  Who
    --
  What
    A tool for assisting biologists with literature search for the task of
     associating genes with Gene Ontology codes
  Where
    MEDLINE



                S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text
                categorization as a tool of associating genes with gene ontology codes.
                In 2nd European Workshop on Data Mining and Text Mining for
                Bioinformatics, pp. 26–30, 2004.
February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

  Who
    Pub.MAS
  What
    A multiagent system for retrieving and classifying publications
  Where
    BMC Bioinformatics
    PubMed Central


                G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
                Retrieving Bioinformatics Publications from Web Sources. IEEE
                Transactions on Nanobioscience, Special Session on GRID, Web
                Services, Software Agents and Ontology Applications for Life Science,
                6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
Case Study:
    Retrieving and Filtering
   Bioinformatic Publications




February 16, 2011 – Valencia (Spain)
An IR Task

                                                                                                Information Extraction
             Online Repositories
                                                                                           Wrapping Information Sources




                                                                   Extracted Data/Information




                                                                                                  Text Categorization
                           Selected Data/Information                                     Taxonomic Classification of Items




                                                       User's Feedback

                                               Adaptive Behavior




February 16, 2011 – Valencia (Spain)
Information Extraction

  Essential to retrieve documents provided by heterogeneous and
    distributed sources




                A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) :
                A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp.
                84–93.
February 16, 2011 – Valencia (Spain)
Text Categorization

  It is the task of determining and assigning topical labels to
   content
  Typical approaches to text categorization
       Statistical
       Semantic
  In the last years several researchers have investigated the use of
    hierarchies for text categorization


                F. Sebastiani. A tutorial on automated text categorisation. Proceedings
                of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7-
                35, 1999.

February 16, 2011 – Valencia (Spain)
Users' Feedback

  It is aimed at dealing with any feedback provided by the user
  In semiautomated classification and adaptive filtering we may
   expect the user of a classifier to provide feedback on how test
   documents have been classified
  In this case further training may be performed during the
   operating phase




February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

 Hierarchical Text Categorization (HTC) deals with problems
 Hierarchical Text Categorization (HTC) deals with problems
 where categories are organized in the form of a hierarchy.
 where categories are organized in the form of a hierarchy.




                D. Koller, M. Sahami. Hierarchically classifying documents using very
                few words. Proceedings of 14th International Conference on Machine
                Learning, pp. 170– 178, 1997.

February 16, 2011 – Valencia (Spain)
HTC at a Glance

  HTC studies how to improve the performances provided by
    classical text categorization techniques by exploiting the
    knowledge of the taxonomic relationships among classes




February 16, 2011 – Valencia (Spain)
Motivations

  People organize large collections of documents in hierarchies of
   topics, or arrange a large body of knowledge in ontologies
  The main goal of automatic text categorization is to deal with
   underlying taxonomies
  A hierarchical approach can
   give benefits in real-world
   scenarios, characterized by
   information overload and
   imbalanced data




February 16, 2011 – Valencia (Spain)
HTC Approaches

  Pachinko machine
     At each level of the hierarchy
             The classifier selects the one most probable category
             It goes down the hierarchy inspecting only the children of the selected
              nodes
  Probabilistic hierarchical local approach
     At each level of the hierarchy
             The classifier makes probabilistic decisions
             It selects the leaf categories on the most probable paths



                S. Kiritchenko. Hierarchical text categorization and its application to
                bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006.
February 16, 2011 – Valencia (Spain)
HTC Approaches

 Local classifier per node
    Each classifier decides if forwarding the document to its children
 Local classifier per parent node
    Each classifier decides to which subtree(s) the document should be
     sent to
 Local classifier per level
    The number of outputs per level grows while going down through
     the taxonomy
 Global classifier
    One classifier is trained, able to discriminate among all categories

                C.J. Silla and A. Freitas. A survey on hierarchical classification across
                different application domains. Journal of Data Mining and Knowledge
                Discovery, 2(1-2), pp. 31-72, 2010.
February 16, 2011 – Valencia (Spain)
Progressive Filtering

 Progressive Filtering (PF) is a simple categorization technique
  that operates on hierarchically structured categories
 A way to implement PF consists of decomposing a given rooted
  taxonomy into pipelines, one for of each path that exists between
  the root and each node of the taxonomy
 Each node is a binary classifier able to recognize whether or not
  an input belongs to the corresponding class
 A threshold selection algorithm (TSA) can be run to identify an
  optimal, or sub-optimal, combination of thresholds for each
  pipeline
                A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to
                Perform Hierarchical Text Categorization in Presence of Input
                Imbalance. Proceedings of International Conference on Knowledge
February 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010.
                Discovery and
PF at a Glance




  Starting from the root, each input traverses the taxonomy as a
     “token”
February 16, 2011 – Valencia (Spain)
Classifiers in PF




  Partitioning the taxonomy in pipelines gives rise to a set of new
    classifiers, each represented by a pipeline


February 16, 2011 – Valencia (Spain)
Classifiers in PF




February 16, 2011 – Valencia (Spain)
Classifiers in PF




  The same classifier may have different behaviours, depending on
   which pipeline it is embedded
  Each pipeline can be considered in isolation from the others
February 16, 2011 – Valencia (Spain)
Threshold Selection in PF

  A relevant problem is how to calibrate the threshold of the
   binary classifiers embedded by each pipeline in order to
   optimize the pipeline behaviour
  Searching for a optimal or sub-optimal combination of
   thresholds in a pipeline can be actually viewed as the problem of
   finding a maximum in a utility function F that depends on the
   corresponding threshold vector θ




February 16, 2011 – Valencia (Spain)
TSA

  For each pipeline the best combination of thresholds is
    calculated according to a bottom up algorithm that uses two
    functions
       Repair which increases/decreases (↑ / ↓ the threshold until the
                                               )
        utility function reaches a maximum
       Calibrate which recursively operates downward from the given
        classifier by repeatedly calling repair (↑ / ↓)


                A. Addis, G. Armano, E. Vargiu. A comparative experimental
                assessment of a threshold selection algorithm in hierarchical text
                categorization. In: Advances in Information Retrieval. The 33rd
                European Conference on Information Retrieval (ECIR 2011), 2011


February 16, 2011 – Valencia (Spain)
TSA: An Example




February 16, 2011 – Valencia (Spain)
The Prototype

  MultiAgent Architecture
     X.MAS
  Agent Framework
     JADE



                A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent
                Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6,
                Sixth International Workshop, From Agent Theory to Agent
                Implementation, pp. 3–9, 2008.

                F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent
                Systems with JADE (Wiley Series in Agent Technology). John Wiley
                and Sons, 2007.
February 16, 2011 – Valencia (Spain)
X.MAS at a Glance

  Macro-architecture




February 16, 2011 – Valencia (Spain)
X.MAS at a Glance
                                                   Information Agent
                                       Scheduler          Source
  Micro-architecture
                                                       Middle Agent
                                       Scheduler        Dispatcher

                                                       Filter Agent
                                       Scheduler         Actuator

                                                       Middle Agent
                                       Scheduler        Dispatcher

                                                        Task Agent
                                       Scheduler         Actuator

                                                       Middle Agent
                                       Scheduler        Dispatcher

                                                     Interface Agent
                                       Scheduler


February 16, 2011 – Valencia (Spain)
Pub.MAS




February 16, 2011 – Valencia (Spain)
Pub.MAS




                G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
                Retrieving Bioinformatics Publications from Web Sources. IEEE
                Transactions on Nanobioscience, Special Session on GRID, Web
                Services, Software Agents and Ontology Applications for Life Science,
                6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
Information Extraction

  It is supported by a set of agents explicitly devoted to
     wrap the selected information sources
     encode the extracted documents
  An information agent wraps BMC Bioinformatics web site
     HTML wrapper
  An information agent wraps PubMed Central digital archive
     Web service wrapper




February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

  The PF approach previously described has been implemented
  Document has been encoded to
     remove all non-informative words
     remove the most common morphological and inflexional suffixes
     select the relevant features
     generate a feature vector for each document
  Classification is performed by wkNN classifiers
     the score is assigned using non parametric density estimation of the
      “ a posteriori” probability




February 16, 2011 – Valencia (Spain)
The Adopted Taxonomy




                P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A.
                 .
                Brass. An ontology for bioinformatics applications, Bioinformatics,
                15(6), pp. 510–520, 1999.
February 16, 2011 – Valencia (Spain)
The Adopted Taxonomy




February 16, 2011 – Valencia (Spain)
The Adopted Taxonomy




February 16, 2011 – Valencia (Spain)
Users' Feedback

  User feedback is aimed at dealing with any feedback provided
   by the user
  Two solutions have been experimented
       training an ANN
       using a kNN classifier




February 16, 2011 – Valencia (Spain)
Experiments

  Different kinds of tests have been performed, each aimed at
    highlighting a specific issue
       we estimated the (normalized) confusion matrix for each classifier
        belonging to the highest level of the taxonomy
       we studied the impact of taking into account pipelines of
        classifiers, also trying to assess whether a residual independence
        was in fact present
       we assessed the solution devised for implementing user’s feedback,
        based on the k-NN technique




February 16, 2011 – Valencia (Spain)
Experiments

  Tests have been performed using selected publications extracted
   from the BMC Bioinformatics site and from the PubMed Central
   digital archive
  Publications have been classified by an expert of the domain
   according to the proposed taxonomy
  For each item of the taxonomy, a set of about 100-150 articles
   has been selected to train the corresponding wk-NN classifier,
   and 300-400 articles have been used to test it




February 16, 2011 – Valencia (Spain)
Conclusions




February 16, 2011 – Valencia (Spain)
Conclusions

  Bioinformatics needs suitable, automated, and “ intelligent”
     solutions to acquire, analyse, organize, and store biological data
    IR might be very useful to face with bioinformatics problems
    Currently, few IR techniques have been adopted to solve some
     bioinformatics tasks
    A system aimed at retrieving and filtering bioinformatics
     publications has been presented as case study
    We argue that further investigations and experiments could be
     made to exploit IR in bioinformatics



February 16, 2011 – Valencia (Spain)
Acknowledgments

  This work was partially supported by the Italian Ministry of
   Education – Investment funds for basic research, under the
   project ITALBIONET – Italian Network of Bioinformatics
  I wish to thank all the IASC Group members for their valuable
   help
  IASC Group members are:
       G. Armano – head
       A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc
       A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students
       S. Curatti – collaborator, programmer
  I wish to thank also Andrea Manconi for his suggestions

February 16, 2011 – Valencia (Spain)
Thanks for your
           attention!
Contact: Eloisa Vargiu vargiu@diee.unica.it

February 16, 2011 – Valencia (Spain)

More Related Content

What's hot

Nucleic acid and protein databanks
Nucleic acid and protein databanksNucleic acid and protein databanks
Nucleic acid and protein databanksNithyaNandapal
 
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjKAUSHAL SAHU
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuKAUSHAL SAHU
 
Protein databases
Protein databasesProtein databases
Protein databasessarumalay
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformaticsVinaKhan1
 
swiss-prot<bioinformatics>
swiss-prot<bioinformatics>swiss-prot<bioinformatics>
swiss-prot<bioinformatics>Pardeep kaushal
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs OsamaZafar16
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentationRida Khalid
 

What's hot (20)

Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Nucleic acid and protein databanks
Nucleic acid and protein databanksNucleic acid and protein databanks
Nucleic acid and protein databanks
 
ENTREZ.ppt
ENTREZ.pptENTREZ.ppt
ENTREZ.ppt
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology Laboratory
 
SWISS-PROT
SWISS-PROTSWISS-PROT
SWISS-PROT
 
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Introduction to Biological databases
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbj
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
Gemome annotation
Gemome annotationGemome annotation
Gemome annotation
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
NCBI
NCBINCBI
NCBI
 
swiss-prot<bioinformatics>
swiss-prot<bioinformatics>swiss-prot<bioinformatics>
swiss-prot<bioinformatics>
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
 

Viewers also liked

Windows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & RetrievalWindows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & RetrievalSaviant Consulting
 
Realtime search engine concept
Realtime search engine conceptRealtime search engine concept
Realtime search engine concept상욱 송
 
Developing Document Image Retrieval System
Developing Document Image Retrieval SystemDeveloping Document Image Retrieval System
Developing Document Image Retrieval SystemKonstantinos Zagoris
 
google search engine
google search enginegoogle search engine
google search engineway2go
 

Viewers also liked (6)

NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
 
Windows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & RetrievalWindows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & Retrieval
 
Text Indexing and Retrieval
Text Indexing and RetrievalText Indexing and Retrieval
Text Indexing and Retrieval
 
Realtime search engine concept
Realtime search engine conceptRealtime search engine concept
Realtime search engine concept
 
Developing Document Image Retrieval System
Developing Document Image Retrieval SystemDeveloping Document Image Retrieval System
Developing Document Image Retrieval System
 
google search engine
google search enginegoogle search engine
google search engine
 

Similar to Bioinformatics Meets Information Retrieval

BIOINFO unit 1.pptx
BIOINFO unit 1.pptxBIOINFO unit 1.pptx
BIOINFO unit 1.pptxrnath286
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu KAUSHAL SAHU
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Bryan Heidorn
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Datavbrant
 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databasesSangeeta Das
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONIJwest
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION dannyijwest
 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in BioinformaticsMeghaj Mallick
 
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkRDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkASIS&T
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...SBituila
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...BibiQuinah
 
How do we know what we don’t know: Using the Neuroscience Information Framew...
How do we know what we don’t know:  Using the Neuroscience Information Framew...How do we know what we don’t know:  Using the Neuroscience Information Framew...
How do we know what we don’t know: Using the Neuroscience Information Framew...Maryann Martone
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBryan Heidorn
 
Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011PrattSILS
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppSimon Jupp
 
Phyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebPhyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebRutger Vos
 

Similar to Bioinformatics Meets Information Retrieval (20)

BIOINFO unit 1.pptx
BIOINFO unit 1.pptxBIOINFO unit 1.pptx
BIOINFO unit 1.pptx
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Data
 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databases
 
Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptx
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in Bioinformatics
 
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkRDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
How do we know what we don’t know: Using the Neuroscience Information Framew...
How do we know what we don’t know:  Using the Neuroscience Information Framew...How do we know what we don’t know:  Using the Neuroscience Information Framew...
How do we know what we don’t know: Using the Neuroscience Information Framew...
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary Challenge
 
Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 
Presentation (3).pptx
Presentation (3).pptxPresentation (3).pptx
Presentation (3).pptx
 
Phyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebPhyloinformatics and the Semantic Web
Phyloinformatics and the Semantic Web
 

More from Eloisa Vargiu

Citizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of healthCitizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of healthEloisa Vargiu
 
Improving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and LleidaImproving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and LleidaEloisa Vargiu
 
Medical Technology in Sleep
Medical Technology in SleepMedical Technology in Sleep
Medical Technology in SleepEloisa Vargiu
 
Patient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care ApproachPatient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care ApproachEloisa Vargiu
 
Connected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in LleidaConnected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in LleidaEloisa Vargiu
 
Self-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A ProposalSelf-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A ProposalEloisa Vargiu
 
Patient Empowerment in CONNECARE
Patient Empowerment in CONNECAREPatient Empowerment in CONNECARE
Patient Empowerment in CONNECAREEloisa Vargiu
 
From Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-ManagementFrom Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-ManagementEloisa Vargiu
 
The CONNECARE Project
The CONNECARE ProjectThe CONNECARE Project
The CONNECARE ProjectEloisa Vargiu
 
Self-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experienceSelf-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experienceEloisa Vargiu
 
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...Eloisa Vargiu
 
Integrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic PatientsIntegrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic PatientsEloisa Vargiu
 
Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...Eloisa Vargiu
 
The CONNECARE project at a glance
The CONNECARE project at a glanceThe CONNECARE project at a glance
The CONNECARE project at a glanceEloisa Vargiu
 
Challenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECAREChallenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECAREEloisa Vargiu
 
Third Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the DifferenceThird Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the DifferenceEloisa Vargiu
 
Monitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons LearnedMonitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons LearnedEloisa Vargiu
 
Monitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experienceMonitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experienceEloisa Vargiu
 
Brain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons LearntBrain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons LearntEloisa Vargiu
 
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Eloisa Vargiu
 

More from Eloisa Vargiu (20)

Citizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of healthCitizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of health
 
Improving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and LleidaImproving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
 
Medical Technology in Sleep
Medical Technology in SleepMedical Technology in Sleep
Medical Technology in Sleep
 
Patient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care ApproachPatient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care Approach
 
Connected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in LleidaConnected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in Lleida
 
Self-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A ProposalSelf-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A Proposal
 
Patient Empowerment in CONNECARE
Patient Empowerment in CONNECAREPatient Empowerment in CONNECARE
Patient Empowerment in CONNECARE
 
From Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-ManagementFrom Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-Management
 
The CONNECARE Project
The CONNECARE ProjectThe CONNECARE Project
The CONNECARE Project
 
Self-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experienceSelf-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experience
 
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
 
Integrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic PatientsIntegrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic Patients
 
Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...
 
The CONNECARE project at a glance
The CONNECARE project at a glanceThe CONNECARE project at a glance
The CONNECARE project at a glance
 
Challenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECAREChallenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECARE
 
Third Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the DifferenceThird Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the Difference
 
Monitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons LearnedMonitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons Learned
 
Monitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experienceMonitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experience
 
Brain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons LearntBrain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons Learnt
 
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
 

Recently uploaded

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Recently uploaded (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Bioinformatics Meets Information Retrieval

  • 1. Bioinformatics Meets Information Retrieval State of the Art and a Case Study Eloisa Vargiu Intelligent Agents and Soft-Computing Group Dept. of Electrical and Electronic Engineering University of Cagliari, Italy February 16, 2011 – Valencia (Spain) email: vargiu@diee.unica.it
  • 2. My Background  2000 – 2004  2004 – 2009  Automatic planning  Bioinformatics  Classic domains: HW[]  Protein secondary structure  Dynamic domains: HIPE prediction: MASSP3 and GAME/SSP  2000 - …  2006 - …  Multiage s te nt ys ms  Information Retrieval  A Personalized Adaptive and Cooperative Multiagent  Hierarchical text System: PACMAS categorization: PF and TSA  A generic architecture to  Recommender systems and perform information retrieval contextual advertising: ConCA tasks: X.MAS February 16, 2011 – Valencia (Spain)
  • 3. Outline  Context and Mission  Why Bioinformatics Needs Information Retrieval  Bioinformatics Meets Information Retrieval  Case Study: Retrieving and Filtering Bioinformatics Publications  Conclusions February 16, 2011 – Valencia (Spain)
  • 4. Context and Mission February 16, 2011 – Valencia (Spain)
  • 5. Web Evolution  Web 1.0 1993  Source of information  Personal homepages  Web 2.0 2004  Social networks  (Micro)Blogging  Web 3.0 2005  Semantic web  Web composition February 16, 2011 – Valencia (Spain)
  • 6. Web Evolution and Bioinformatics  A long time ago...  Data was stored in local DBs  Data was shared as flat files  Biologists worked alone or in small groups February 16, 2011 – Valencia (Spain)
  • 7. Web Evolution and Bioinformatics  Today...  Online repositories  The major sources of nucleotide sequence are the ones belonging to the International Nucleotide Sequence Database Collaboration  DDBJ (DNA DataBank of Japan)  EMBL (European Molecular Biology Laboratory)  GenBank (NIH genetic sequence database)  Web services  Basic bioinformatics services are classified by the EBI into three categories  SSS (Sequence Search Services)  MSA (Multiple Sequence Alignment)  BSA (Biological Sequence Analysis) February 16, 2011 – Valencia (Spain)
  • 8. Web Evolution and Scientific Publications  A long time ago...  Publications were consulted at the library  Just two or three relevant available journals  Manual selection of relevant publications February 16, 2011 – Valencia (Spain)
  • 9. Web Evolution and Scientific Publications  Today...  Online journals  Online conference proceedings  Publications are often available for free  Manual selection of relevant publications becomes unfeasible February 16, 2011 – Valencia (Spain)
  • 10. As a Consequence...  Unstructured information  Information overload  Personalized information selection and input imbalance February 16, 2011 – Valencia (Spain)
  • 11. Our Mission  To cope with  Unstructured information, classifying documents according to a given taxonomy  Information overload, filtering information to reduce redundancy  Personalized information selection and input imbalance, filtering information according to user preferences  Case study  Retrieving and filtering bioinformatics publications February 16, 2011 – Valencia (Spain)
  • 12. Research Topics  Information Retrieval  Bioinformatics February 16, 2011 – Valencia (Spain)
  • 13. Information Retrieval Information Retrieval (IR) deals with the representation, Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. storage, organization of, and access to information items. The user must first translate this information need into a query The user must first translate this information need into a query which can be processed by an IR system. which can be processed by an IR system. Given the user query, the key goal of an IR system is to retrieve Given the user query, the key goal of an IR system is to retrieve information which might be useful or relevant to the user. information which might be useful or relevant to the user. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. New York: Addison-Wesley, 1999. February 16, 2011 – Valencia (Spain)
  • 14. Main IR Topics  Indexing  Search and Web Search  Information Filtering  Text Mining  Text Categorization and Hierarchical Text Categorization February 16, 2011 – Valencia (Spain)
  • 15. Bioinformatics Bioinformatics is the field of science in which biology, Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a computer science, and information technology merge to form a single discipline. single discipline. The ultimate goal of the field is to enable the discovery of new The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. which unifying principles in biology can be discerned. National Center for Biotechnology Information (NCBI), http://www.ncbi.nlm.nih.gov/. February 16, 2011 – Valencia (Spain)
  • 16. Main Bioinformatics Research Areas  Sequence analysis  Genome annotation  Computational evolutionary biology  Analysis of gene expression  Analysis of protein expression  Analysis of mutations in cancer  Comparative genomics  Modelling biological systems  Prediction of protein structure  Molecular interaction February 16, 2011 – Valencia (Spain)
  • 17. Why Bioinformatics Needs Information Retrieval February 16, 2011 – Valencia (Spain)
  • 18. Does Bioinformatics Need IR?  Bioinformatics is concerned with researching, developing and applying tools and methods to acquire, analyse, organize and store biological and medical data  Indexing and search techniques may help in the task of acquiring  Information filtering, text mining and text categorization techniques may be useful to the analysis of data  Text categorization, with particular reference to hierarchical text categorization, may be used in the organization and storage tasks February 16, 2011 – Valencia (Spain)
  • 19. Bioinformatics Data  A very huge amount of of data to be  Indexed  Searched for in large databases or on the web  Filtered according to users' preferences  Text mined  Categorized according to its textual content February 16, 2011 – Valencia (Spain)
  • 20. DB Indexing  Why  Data types are relegated to blob and unstructured text fields  Few results in building persistent access paths to support fast retrieval methods  Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample  Annotations are not mapped to concept in any ontology February 16, 2011 – Valencia (Spain)
  • 21. DB Indexing  Who  MoBIoS – Molecular Biological Information System  What  A specialized database management system  The storage manager is based on metric-space indexing  Query language entails biological data types  Where  Sequence homology: local alignment and mutations D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to Support Biological Discovery. Proceedings of the International Conference on Scientific and Statistical Database Management Systems, 2003. February 16, 2011 – Valencia (Spain)
  • 22. DB Indexing  Who  --  What  Ontology-driven indexing of public datasets for translational bioinformatics  Methods to map text annotations of gene expression datasets to concept in the UMLS  Where  Gene Expression Omnibus  Standford Tissue Microarray Database N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A. . Musen. Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009. February 16, 2011 – Valencia (Spain)
  • 23. Web Indexing  Why  Most often sequence retrieval tools and sequence analysis tools are separated  The usage of sequence DBs is often general and limited to keyword searching and entry retrieval  Discovering and accessing the appropriate bioinformatics resource for a specific task has become increasingly important February 16, 2011 – Valencia (Spain)
  • 24. Web Indexing  Who  SIRW – A Web Server for Simple Indexing and Retrieval System  What  A WWW interface to the Simple Indexing and Retrieval (SIR) system to parse and index flat file DBs  A framework for doing sequence analysis for selected biological sequences  Where  Sequence analysis: motif pattern searches C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval System that combines sequence motif searches with keyword searches. Nucleic Acids Research, 31(13). pp. 3771-3774, 2003. February 16, 2011 – Valencia (Spain)
  • 25. Web Indexing  Who  BIRI - BIoinformatics Resource Inventory  What  An approach for automatically discovering and indexing public bioinformatics resources  Where  The scientific literature G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V. Maojo. BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC Bioinformatics, Oct 7;10:320, 2009. February 16, 2011 – Valencia (Spain)
  • 26. DB Search  Why  A wealth of bioinformatics tools and databases has been created over the last decade and most are freely available  Often it is desired to visualize the database hits stacked according to the query sequence  There is no inventory presenting an up-to-date and easily searchable index of all these resources February 16, 2011 – Valencia (Spain)
  • 27. DB Search  Who  MView – Multiple alignment Viewer  What  A tool for converting the result of a sequence database search into the form of a coloured multiple alignment of hits stacked against the query  Where  Multiple alignment N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible . database search or multiple alignment viewer. Bioinformatics, 14(4), pp. 380-381, 1998. February 16, 2011 – Valencia (Spain)
  • 28. DB Search  Who  BioWareDB  What  An extensive and current catalog of software and DBs of relevance to researchers in the field of biology and medicine  Where  Current and available biomedical computing resources M.W. Matthiessen. BioWareDB: the biomedical software and database search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003. February 16, 2011 – Valencia (Spain)
  • 29. Web Search  Why  Today, scientists can easily post their research findings on the Web or compare their discoveries with previous work  Manually maintaining a wrapper library will not scale to accommodate the growth of genomics data sources on the Web February 16, 2011 – Valencia (Spain)
  • 30. Web Search  Who  ---  What  An automated system able to find, classify, and wrap new sources without constant human intervention  Where  Distributed genomics data sources D. Rocco and T. Critchlow. Automatic discovery and classification of bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933, 2003. February 16, 2011 – Valencia (Spain)
  • 31. Web Search  Who  GoPubMed  What  An ontology-based literature search applied to Gene Ontology (GO) and PubMed  Where  Scientific literature R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed: ontology-based literature search applied to gene ontology and PubMed. In Proceedings of German Bioinformatics Conference, pp. 169–178, 2004. February 16, 2011 – Valencia (Spain)
  • 32. Information Filtering  Why  In the Web 2.0 scenario, users look for collaborative environments, in which they can meet further users with similar preferences and needs  Researchers need to search for and/or generate specialized datasets that meet specific requirements February 16, 2011 – Valencia (Spain)
  • 33. Information Filtering  Who  ProDaMa-C Protein Dataset Management – Collaborative  What  A web application aimed at  Generating specialized protein structure datasets  Favouring the collaboration among researchers  Where  Protein structures G. Armano and A. Manconi. A Collaborative Web Application for Supporting Researchers in the Task of Generating Protein Datasets. Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A. Soro, E. Vargiu (eds.), Springer-Verlag, 2011. February 16, 2011 – Valencia (Spain)
  • 34. Information Filtering  Who  Gene Recommender  What  An algorithm that ranks genes according to how strongly they correlate with a set of query genes  Where  Analysis of gene expression A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene recommender algorithm to identify coexpressed genes. Genome Research, Aug;13(8), pp. 1828-37, 2003. February 16, 2011 – Valencia (Spain)
  • 35. Text Mining  Why  Web-based tools capable of filtering public DBs are more and more required  Interesting and useful information, relevant to the researcher, could appear in documents (e.g., papers) they have not read and therefore be missed entirely  Of paramount importance to DB search methods is a reliable means of distinguishing true hits from false hits  Biologists construct a pathway by reading a large number of articles and interpreting them a consistent network, but the link to the original article is missed February 16, 2011 – Valencia (Spain)
  • 36. Text Mining  Who  MedMiner  What  An Internet text mining tool that filters the literature and presents the most relevant portions in a well-organized way that facilitate understanding  Where  Gene expression profiling L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N. Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling. Biotechniques, Dec;27(6), pp. 1210-4, 1999. February 16, 2011 – Valencia (Spain)
  • 37. Text Mining  Who  BioRAT  What  A research assistant that, given a query,  autonomously finds a set of papers  reads them  highlights the most relevant facts in each  Where  Scientific literature D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones. . BioRAT: Extracting biological information from full-length papers. Bioinformatics, 20(17), pp. 3206–3213, 2004. February 16, 2011 – Valencia (Spain)
  • 38. Text Mining  Who  SAWTED – Structure Assignment With Text Description  What  An automated system to filtering DB hits  Where  Homologues annotation R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure assignment with text description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics, Feb;16(2), pp. 125-9, 2000. February 16, 2011 – Valencia (Spain)
  • 39. Text Mining  Who  PathText  What  A system to integrate a pathway visualized, text mining systems and annotation tools into a seamless environment  Where  Pathway visualizations B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S. Ananiadou, and J. Tsujii. PathText: a text mining integrator for biological pathway visualizations. Bioinformatics, 26(12), pp. i374- i381, 2010. February 16, 2011 – Valencia (Spain)
  • 40. Text Categorization  Why  Information in text form, such as MEDLINE records, is a greatly underutilized source of biological information  Individual researchers find it difficult to keep up with all the new, relevant information  Systems that extract structured information from natural language passages have been highly successful in specialized domains  Time is ripe for developing such applications for molecular biology and genomics February 16, 2011 – Valencia (Spain)
  • 41. Text Categorization  Who  --  What  Constructing biological knowledge bases by extracting information from text sources  Where  MEDLINE M. Craven and J. Kumlien. Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, 1999. February 16, 2011 – Valencia (Spain)
  • 42. Text Categorization  Who  Genies  What  A natural-language processing system for the extraction of molecular pathways  Where  Scientific publications C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies: . a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001. February 16, 2011 – Valencia (Spain)
  • 43. Hierarchical Text Categorization  Why  A great deal of genomics information accumulated through years is available in online text repositories (such as MEDLINE)  These resources do not still provide adequate mechanisms for retrieving the required information  Traditional filtering techniques based on keyword search are often inadequate to express what the user is really searching for  Web repositories, such as Medical Subject Headings (MeSH) in MEDLINE, encompass an underlying taxonomy February 16, 2011 – Valencia (Spain)
  • 44. Hierarchical Text Categorization  Who  --  What  A tool for assisting biologists with literature search for the task of associating genes with Gene Ontology codes  Where  MEDLINE S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text categorization as a tool of associating genes with gene ontology codes. In 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, pp. 26–30, 2004. February 16, 2011 – Valencia (Spain)
  • 45. Hierarchical Text Categorization  Who  Pub.MAS  What  A multiagent system for retrieving and classifying publications  Where  BMC Bioinformatics  PubMed Central G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for Retrieving Bioinformatics Publications from Web Sources. IEEE Transactions on Nanobioscience, Special Session on GRID, Web Services, Software Agents and Ontology Applications for Life Science, 6(2), pp. 104-109, 2007. February 16, 2011 – Valencia (Spain)
  • 46. Case Study: Retrieving and Filtering Bioinformatic Publications February 16, 2011 – Valencia (Spain)
  • 47. An IR Task Information Extraction Online Repositories Wrapping Information Sources Extracted Data/Information Text Categorization Selected Data/Information Taxonomic Classification of Items User's Feedback Adaptive Behavior February 16, 2011 – Valencia (Spain)
  • 48. Information Extraction  Essential to retrieve documents provided by heterogeneous and distributed sources A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) : A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp. 84–93. February 16, 2011 – Valencia (Spain)
  • 49. Text Categorization  It is the task of determining and assigning topical labels to content  Typical approaches to text categorization  Statistical  Semantic  In the last years several researchers have investigated the use of hierarchies for text categorization F. Sebastiani. A tutorial on automated text categorisation. Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7- 35, 1999. February 16, 2011 – Valencia (Spain)
  • 50. Users' Feedback  It is aimed at dealing with any feedback provided by the user  In semiautomated classification and adaptive filtering we may expect the user of a classifier to provide feedback on how test documents have been classified  In this case further training may be performed during the operating phase February 16, 2011 – Valencia (Spain)
  • 51. Hierarchical Text Categorization Hierarchical Text Categorization (HTC) deals with problems Hierarchical Text Categorization (HTC) deals with problems where categories are organized in the form of a hierarchy. where categories are organized in the form of a hierarchy. D. Koller, M. Sahami. Hierarchically classifying documents using very few words. Proceedings of 14th International Conference on Machine Learning, pp. 170– 178, 1997. February 16, 2011 – Valencia (Spain)
  • 52. HTC at a Glance  HTC studies how to improve the performances provided by classical text categorization techniques by exploiting the knowledge of the taxonomic relationships among classes February 16, 2011 – Valencia (Spain)
  • 53. Motivations  People organize large collections of documents in hierarchies of topics, or arrange a large body of knowledge in ontologies  The main goal of automatic text categorization is to deal with underlying taxonomies  A hierarchical approach can give benefits in real-world scenarios, characterized by information overload and imbalanced data February 16, 2011 – Valencia (Spain)
  • 54. HTC Approaches  Pachinko machine  At each level of the hierarchy  The classifier selects the one most probable category  It goes down the hierarchy inspecting only the children of the selected nodes  Probabilistic hierarchical local approach  At each level of the hierarchy  The classifier makes probabilistic decisions  It selects the leaf categories on the most probable paths S. Kiritchenko. Hierarchical text categorization and its application to bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006. February 16, 2011 – Valencia (Spain)
  • 55. HTC Approaches  Local classifier per node  Each classifier decides if forwarding the document to its children  Local classifier per parent node  Each classifier decides to which subtree(s) the document should be sent to  Local classifier per level  The number of outputs per level grows while going down through the taxonomy  Global classifier  One classifier is trained, able to discriminate among all categories C.J. Silla and A. Freitas. A survey on hierarchical classification across different application domains. Journal of Data Mining and Knowledge Discovery, 2(1-2), pp. 31-72, 2010. February 16, 2011 – Valencia (Spain)
  • 56. Progressive Filtering  Progressive Filtering (PF) is a simple categorization technique that operates on hierarchically structured categories  A way to implement PF consists of decomposing a given rooted taxonomy into pipelines, one for of each path that exists between the root and each node of the taxonomy  Each node is a binary classifier able to recognize whether or not an input belongs to the corresponding class  A threshold selection algorithm (TSA) can be run to identify an optimal, or sub-optimal, combination of thresholds for each pipeline A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to Perform Hierarchical Text Categorization in Presence of Input Imbalance. Proceedings of International Conference on Knowledge February 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010. Discovery and
  • 57. PF at a Glance  Starting from the root, each input traverses the taxonomy as a “token” February 16, 2011 – Valencia (Spain)
  • 58. Classifiers in PF  Partitioning the taxonomy in pipelines gives rise to a set of new classifiers, each represented by a pipeline February 16, 2011 – Valencia (Spain)
  • 59. Classifiers in PF February 16, 2011 – Valencia (Spain)
  • 60. Classifiers in PF  The same classifier may have different behaviours, depending on which pipeline it is embedded  Each pipeline can be considered in isolation from the others February 16, 2011 – Valencia (Spain)
  • 61. Threshold Selection in PF  A relevant problem is how to calibrate the threshold of the binary classifiers embedded by each pipeline in order to optimize the pipeline behaviour  Searching for a optimal or sub-optimal combination of thresholds in a pipeline can be actually viewed as the problem of finding a maximum in a utility function F that depends on the corresponding threshold vector θ February 16, 2011 – Valencia (Spain)
  • 62. TSA  For each pipeline the best combination of thresholds is calculated according to a bottom up algorithm that uses two functions  Repair which increases/decreases (↑ / ↓ the threshold until the ) utility function reaches a maximum  Calibrate which recursively operates downward from the given classifier by repeatedly calling repair (↑ / ↓) A. Addis, G. Armano, E. Vargiu. A comparative experimental assessment of a threshold selection algorithm in hierarchical text categorization. In: Advances in Information Retrieval. The 33rd European Conference on Information Retrieval (ECIR 2011), 2011 February 16, 2011 – Valencia (Spain)
  • 63. TSA: An Example February 16, 2011 – Valencia (Spain)
  • 64. The Prototype  MultiAgent Architecture  X.MAS  Agent Framework  JADE A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6, Sixth International Workshop, From Agent Theory to Agent Implementation, pp. 3–9, 2008. F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent Systems with JADE (Wiley Series in Agent Technology). John Wiley and Sons, 2007. February 16, 2011 – Valencia (Spain)
  • 65. X.MAS at a Glance  Macro-architecture February 16, 2011 – Valencia (Spain)
  • 66. X.MAS at a Glance Information Agent Scheduler Source  Micro-architecture Middle Agent Scheduler Dispatcher Filter Agent Scheduler Actuator Middle Agent Scheduler Dispatcher Task Agent Scheduler Actuator Middle Agent Scheduler Dispatcher Interface Agent Scheduler February 16, 2011 – Valencia (Spain)
  • 67. Pub.MAS February 16, 2011 – Valencia (Spain)
  • 68. Pub.MAS G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for Retrieving Bioinformatics Publications from Web Sources. IEEE Transactions on Nanobioscience, Special Session on GRID, Web Services, Software Agents and Ontology Applications for Life Science, 6(2), pp. 104-109, 2007. February 16, 2011 – Valencia (Spain)
  • 69. Information Extraction  It is supported by a set of agents explicitly devoted to  wrap the selected information sources  encode the extracted documents  An information agent wraps BMC Bioinformatics web site  HTML wrapper  An information agent wraps PubMed Central digital archive  Web service wrapper February 16, 2011 – Valencia (Spain)
  • 70. Hierarchical Text Categorization  The PF approach previously described has been implemented  Document has been encoded to  remove all non-informative words  remove the most common morphological and inflexional suffixes  select the relevant features  generate a feature vector for each document  Classification is performed by wkNN classifiers  the score is assigned using non parametric density estimation of the “ a posteriori” probability February 16, 2011 – Valencia (Spain)
  • 71. The Adopted Taxonomy P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A. . Brass. An ontology for bioinformatics applications, Bioinformatics, 15(6), pp. 510–520, 1999. February 16, 2011 – Valencia (Spain)
  • 72. The Adopted Taxonomy February 16, 2011 – Valencia (Spain)
  • 73. The Adopted Taxonomy February 16, 2011 – Valencia (Spain)
  • 74. Users' Feedback  User feedback is aimed at dealing with any feedback provided by the user  Two solutions have been experimented  training an ANN  using a kNN classifier February 16, 2011 – Valencia (Spain)
  • 75. Experiments  Different kinds of tests have been performed, each aimed at highlighting a specific issue  we estimated the (normalized) confusion matrix for each classifier belonging to the highest level of the taxonomy  we studied the impact of taking into account pipelines of classifiers, also trying to assess whether a residual independence was in fact present  we assessed the solution devised for implementing user’s feedback, based on the k-NN technique February 16, 2011 – Valencia (Spain)
  • 76. Experiments  Tests have been performed using selected publications extracted from the BMC Bioinformatics site and from the PubMed Central digital archive  Publications have been classified by an expert of the domain according to the proposed taxonomy  For each item of the taxonomy, a set of about 100-150 articles has been selected to train the corresponding wk-NN classifier, and 300-400 articles have been used to test it February 16, 2011 – Valencia (Spain)
  • 77. Conclusions February 16, 2011 – Valencia (Spain)
  • 78. Conclusions  Bioinformatics needs suitable, automated, and “ intelligent” solutions to acquire, analyse, organize, and store biological data  IR might be very useful to face with bioinformatics problems  Currently, few IR techniques have been adopted to solve some bioinformatics tasks  A system aimed at retrieving and filtering bioinformatics publications has been presented as case study  We argue that further investigations and experiments could be made to exploit IR in bioinformatics February 16, 2011 – Valencia (Spain)
  • 79. Acknowledgments  This work was partially supported by the Italian Ministry of Education – Investment funds for basic research, under the project ITALBIONET – Italian Network of Bioinformatics  I wish to thank all the IASC Group members for their valuable help  IASC Group members are:  G. Armano – head  A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc  A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students  S. Curatti – collaborator, programmer  I wish to thank also Andrea Manconi for his suggestions February 16, 2011 – Valencia (Spain)
  • 80. Thanks for your attention! Contact: Eloisa Vargiu vargiu@diee.unica.it February 16, 2011 – Valencia (Spain)