Your SlideShare is downloading. ×
Bioinformatics Meets               Information Retrieval   State of the Art and a Case Study                              ...
My Background  2000 – 2004                                 2004 – 2009     Automatic planning                         ...
Outline    Context and Mission    Why Bioinformatics Needs Information Retrieval    Bioinformatics Meets Information Re...
Context and MissionFebruary 16, 2011 – Valencia (Spain)
Web Evolution  Web 1.0                             1993    Source of information    Personal homepages  Web 2.0       ...
Web Evolution and Bioinformatics  A long time ago...     Data was stored in local DBs     Data was shared as flat files...
Web Evolution and Bioinformatics  Today...     Online repositories             The major sources of nucleotide sequence...
Web Evolution and ScientificPublications  A long time ago...     Publications were consulted at the library     Just tw...
Web Evolution and ScientificPublications  Today...     Online journals     Online conference proceedings     Publicati...
As a Consequence...  Unstructured information  Information overload  Personalized information selection and input imbal...
Our Mission  To cope with     Unstructured information, classifying documents according to a      given taxonomy     In...
Research Topics  Information Retrieval  BioinformaticsFebruary 16, 2011 – Valencia (Spain)
Information Retrieval Information Retrieval (IR) deals with the representation,  Information Retrieval (IR) deals with the...
Main IR Topics    Indexing    Search and Web Search    Information Filtering    Text Mining    Text Categorization an...
Bioinformatics Bioinformatics is the field of science in which biology,  Bioinformatics is the field of science in which b...
Main Bioinformatics Research Areas    Sequence analysis    Genome annotation    Computational evolutionary biology    ...
Why Bioinformatics                       Needs                Information RetrievalFebruary 16, 2011 – Valencia (Spain)
Does Bioinformatics Need IR?  Bioinformatics is concerned with researching, developing and    applying tools and methods ...
Bioinformatics Data  A very huge amount of of data to be     Indexed     Searched for in large databases or on the web ...
DB Indexing  Why    Data types are relegated to blob and unstructured text fields    Few results in building persistent...
DB Indexing  Who    MoBIoS – Molecular Biological Information System  What    A specialized database management system...
DB Indexing Who   -- What   Ontology-driven indexing of public datasets for translational    bioinformatics   Methods...
Web Indexing  Why    Most often sequence retrieval tools and sequence analysis tools are     separated    The usage of ...
Web Indexing  Who    SIRW – A Web Server for Simple Indexing and Retrieval System  What    A WWW interface to the Simp...
Web Indexing  Who    BIRI - BIoinformatics Resource Inventory  What    An approach for automatically discovering and i...
DB Search  Why    A wealth of bioinformatics tools and databases has been created     over the last decade and most are ...
DB Search  Who    MView – Multiple alignment Viewer  What    A tool for converting the result of a sequence database s...
DB Search  Who    BioWareDB  What    An extensive and current catalog of software and DBs of relevance     to research...
Web Search  Why    Today, scientists can easily post their research findings on the Web     or compare their discoveries...
Web Search  Who    ---  What    An automated system able to find, classify, and wrap new sources     without constant ...
Web Search  Who    GoPubMed  What    An ontology-based literature search applied to Gene Ontology     (GO) and PubMed ...
Information Filtering  Why    In the Web 2.0 scenario, users look for collaborative environments,     in which they can ...
Information Filtering  Who    ProDaMa-C Protein Dataset Management – Collaborative  What    A web application aimed at...
Information Filtering  Who    Gene Recommender  What    An algorithm that ranks genes according to how strongly they  ...
Text Mining  Why    Web-based tools capable of filtering public DBs are more and more     required    Interesting and u...
Text Mining  Who    MedMiner  What    An Internet text mining tool that filters the literature and presents     the mo...
Text Mining  Who    BioRAT  What    A research assistant that, given a query,             autonomously finds a set of...
Text Mining  Who    SAWTED – Structure Assignment With Text Description  What    An automated system to filtering DB h...
Text Mining  Who    PathText  What    A system to integrate a pathway visualized, text mining systems     and annotati...
Text Categorization  Why    Information in text form, such as MEDLINE records, is a greatly     underutilized source of ...
Text Categorization  Who    --  What    Constructing biological knowledge bases by extracting information     from tex...
Text Categorization  Who    Genies  What    A natural-language processing system for the extraction of     molecular p...
Hierarchical Text Categorization  Why    A great deal of genomics information accumulated through years is     available...
Hierarchical Text Categorization  Who    --  What    A tool for assisting biologists with literature search for the ta...
Hierarchical Text Categorization  Who    Pub.MAS  What    A multiagent system for retrieving and classifying publicati...
Case Study:    Retrieving and Filtering   Bioinformatic PublicationsFebruary 16, 2011 – Valencia (Spain)
An IR Task                                                                                                Information Extr...
Information Extraction  Essential to retrieve documents provided by heterogeneous and    distributed sources             ...
Text Categorization  It is the task of determining and assigning topical labels to   content  Typical approaches to text...
Users Feedback  It is aimed at dealing with any feedback provided by the user  In semiautomated classification and adapt...
Hierarchical Text Categorization Hierarchical Text Categorization (HTC) deals with problems Hierarchical Text Categorizati...
HTC at a Glance  HTC studies how to improve the performances provided by    classical text categorization techniques by e...
Motivations  People organize large collections of documents in hierarchies of   topics, or arrange a large body of knowle...
HTC Approaches  Pachinko machine     At each level of the hierarchy             The classifier selects the one most pro...
HTC Approaches Local classifier per node    Each classifier decides if forwarding the document to its children Local cl...
Progressive Filtering Progressive Filtering (PF) is a simple categorization technique  that operates on hierarchically st...
PF at a Glance  Starting from the root, each input traverses the taxonomy as a     “token”February 16, 2011 – Valencia (S...
Classifiers in PF  Partitioning the taxonomy in pipelines gives rise to a set of new    classifiers, each represented by ...
Classifiers in PFFebruary 16, 2011 – Valencia (Spain)
Classifiers in PF  The same classifier may have different behaviours, depending on   which pipeline it is embedded  Each...
Threshold Selection in PF  A relevant problem is how to calibrate the threshold of the   binary classifiers embedded by e...
TSA  For each pipeline the best combination of thresholds is    calculated according to a bottom up algorithm that uses t...
TSA: An ExampleFebruary 16, 2011 – Valencia (Spain)
The Prototype  MultiAgent Architecture     X.MAS  Agent Framework     JADE                A. Addis, G. Armano, E. Varg...
X.MAS at a Glance  Macro-architectureFebruary 16, 2011 – Valencia (Spain)
X.MAS at a Glance                                                   Information Agent                                     ...
Pub.MASFebruary 16, 2011 – Valencia (Spain)
Pub.MAS                G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for                Retrieving Bioinformat...
Information Extraction  It is supported by a set of agents explicitly devoted to     wrap the selected information sourc...
Hierarchical Text Categorization  The PF approach previously described has been implemented  Document has been encoded t...
The Adopted Taxonomy                P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A.                ...
The Adopted TaxonomyFebruary 16, 2011 – Valencia (Spain)
The Adopted TaxonomyFebruary 16, 2011 – Valencia (Spain)
Users Feedback  User feedback is aimed at dealing with any feedback provided   by the user  Two solutions have been expe...
Experiments  Different kinds of tests have been performed, each aimed at    highlighting a specific issue       we estim...
Experiments  Tests have been performed using selected publications extracted   from the BMC Bioinformatics site and from ...
ConclusionsFebruary 16, 2011 – Valencia (Spain)
Conclusions  Bioinformatics needs suitable, automated, and “ intelligent”     solutions to acquire, analyse, organize, an...
Acknowledgments  This work was partially supported by the Italian Ministry of   Education – Investment funds for basic re...
Thanks for your           attention!Contact: Eloisa Vargiu vargiu@diee.unica.itFebruary 16, 2011 – Valencia (Spain)
Upcoming SlideShare
Loading in...5
×

Bioinformatics Meets Information Retrieval: State of the Art and a Case Study

2,524

Published on

Held at Universitat Politecnica de Valencia (invited by Prof. O. Pastor). Valencia (Spain), February 16, 2011

Published in: Technology, Education

Transcript of "Bioinformatics Meets Information Retrieval: State of the Art and a Case Study"

  1. 1. Bioinformatics Meets Information Retrieval State of the Art and a Case Study Eloisa Vargiu Intelligent Agents and Soft-Computing Group Dept. of Electrical and Electronic Engineering University of Cagliari, ItalyFebruary 16, 2011 – Valencia (Spain) email: vargiu@diee.unica.it
  2. 2. My Background  2000 – 2004  2004 – 2009  Automatic planning  Bioinformatics  Classic domains: HW[]  Protein secondary structure  Dynamic domains: HIPE prediction: MASSP3 and GAME/SSP  2000 - …  2006 - …  Multiage s te nt ys ms  Information Retrieval  A Personalized Adaptive and Cooperative Multiagent  Hierarchical text System: PACMAS categorization: PF and TSA  A generic architecture to  Recommender systems and perform information retrieval contextual advertising: ConCA tasks: X.MASFebruary 16, 2011 – Valencia (Spain)
  3. 3. Outline  Context and Mission  Why Bioinformatics Needs Information Retrieval  Bioinformatics Meets Information Retrieval  Case Study: Retrieving and Filtering Bioinformatics Publications  ConclusionsFebruary 16, 2011 – Valencia (Spain)
  4. 4. Context and MissionFebruary 16, 2011 – Valencia (Spain)
  5. 5. Web Evolution  Web 1.0 1993  Source of information  Personal homepages  Web 2.0 2004  Social networks  (Micro)Blogging  Web 3.0 2005  Semantic web  Web compositionFebruary 16, 2011 – Valencia (Spain)
  6. 6. Web Evolution and Bioinformatics  A long time ago...  Data was stored in local DBs  Data was shared as flat files  Biologists worked alone or in small groupsFebruary 16, 2011 – Valencia (Spain)
  7. 7. Web Evolution and Bioinformatics  Today...  Online repositories  The major sources of nucleotide sequence are the ones belonging to the International Nucleotide Sequence Database Collaboration  DDBJ (DNA DataBank of Japan)  EMBL (European Molecular Biology Laboratory)  GenBank (NIH genetic sequence database)  Web services  Basic bioinformatics services are classified by the EBI into three categories  SSS (Sequence Search Services)  MSA (Multiple Sequence Alignment)  BSA (Biological Sequence Analysis)February 16, 2011 – Valencia (Spain)
  8. 8. Web Evolution and ScientificPublications  A long time ago...  Publications were consulted at the library  Just two or three relevant available journals  Manual selection of relevant publicationsFebruary 16, 2011 – Valencia (Spain)
  9. 9. Web Evolution and ScientificPublications  Today...  Online journals  Online conference proceedings  Publications are often available for free  Manual selection of relevant publications becomes unfeasibleFebruary 16, 2011 – Valencia (Spain)
  10. 10. As a Consequence...  Unstructured information  Information overload  Personalized information selection and input imbalanceFebruary 16, 2011 – Valencia (Spain)
  11. 11. Our Mission  To cope with  Unstructured information, classifying documents according to a given taxonomy  Information overload, filtering information to reduce redundancy  Personalized information selection and input imbalance, filtering information according to user preferences  Case study  Retrieving and filtering bioinformatics publicationsFebruary 16, 2011 – Valencia (Spain)
  12. 12. Research Topics  Information Retrieval  BioinformaticsFebruary 16, 2011 – Valencia (Spain)
  13. 13. Information Retrieval Information Retrieval (IR) deals with the representation, Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. storage, organization of, and access to information items. The user must first translate this information need into a query The user must first translate this information need into a query which can be processed by an IR system. which can be processed by an IR system. Given the user query, the key goal of an IR system is to retrieve Given the user query, the key goal of an IR system is to retrieve information which might be useful or relevant to the user. information which might be useful or relevant to the user. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. New York: Addison-Wesley, 1999.February 16, 2011 – Valencia (Spain)
  14. 14. Main IR Topics  Indexing  Search and Web Search  Information Filtering  Text Mining  Text Categorization and Hierarchical Text CategorizationFebruary 16, 2011 – Valencia (Spain)
  15. 15. Bioinformatics Bioinformatics is the field of science in which biology, Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a computer science, and information technology merge to form a single discipline. single discipline. The ultimate goal of the field is to enable the discovery of new The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. which unifying principles in biology can be discerned. National Center for Biotechnology Information (NCBI), http://www.ncbi.nlm.nih.gov/.February 16, 2011 – Valencia (Spain)
  16. 16. Main Bioinformatics Research Areas  Sequence analysis  Genome annotation  Computational evolutionary biology  Analysis of gene expression  Analysis of protein expression  Analysis of mutations in cancer  Comparative genomics  Modelling biological systems  Prediction of protein structure  Molecular interactionFebruary 16, 2011 – Valencia (Spain)
  17. 17. Why Bioinformatics Needs Information RetrievalFebruary 16, 2011 – Valencia (Spain)
  18. 18. Does Bioinformatics Need IR?  Bioinformatics is concerned with researching, developing and applying tools and methods to acquire, analyse, organize and store biological and medical data  Indexing and search techniques may help in the task of acquiring  Information filtering, text mining and text categorization techniques may be useful to the analysis of data  Text categorization, with particular reference to hierarchical text categorization, may be used in the organization and storage tasksFebruary 16, 2011 – Valencia (Spain)
  19. 19. Bioinformatics Data  A very huge amount of of data to be  Indexed  Searched for in large databases or on the web  Filtered according to users preferences  Text mined  Categorized according to its textual contentFebruary 16, 2011 – Valencia (Spain)
  20. 20. DB Indexing  Why  Data types are relegated to blob and unstructured text fields  Few results in building persistent access paths to support fast retrieval methods  Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample  Annotations are not mapped to concept in any ontologyFebruary 16, 2011 – Valencia (Spain)
  21. 21. DB Indexing  Who  MoBIoS – Molecular Biological Information System  What  A specialized database management system  The storage manager is based on metric-space indexing  Query language entails biological data types  Where  Sequence homology: local alignment and mutations D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to Support Biological Discovery. Proceedings of the International Conference on Scientific and Statistical Database Management Systems, 2003.February 16, 2011 – Valencia (Spain)
  22. 22. DB Indexing Who  -- What  Ontology-driven indexing of public datasets for translational bioinformatics  Methods to map text annotations of gene expression datasets to concept in the UMLS Where  Gene Expression Omnibus  Standford Tissue Microarray Database N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A. . Musen. Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009.February 16, 2011 – Valencia (Spain)
  23. 23. Web Indexing  Why  Most often sequence retrieval tools and sequence analysis tools are separated  The usage of sequence DBs is often general and limited to keyword searching and entry retrieval  Discovering and accessing the appropriate bioinformatics resource for a specific task has become increasingly importantFebruary 16, 2011 – Valencia (Spain)
  24. 24. Web Indexing  Who  SIRW – A Web Server for Simple Indexing and Retrieval System  What  A WWW interface to the Simple Indexing and Retrieval (SIR) system to parse and index flat file DBs  A framework for doing sequence analysis for selected biological sequences  Where  Sequence analysis: motif pattern searches C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval System that combines sequence motif searches with keyword searches. Nucleic Acids Research, 31(13). pp. 3771-3774, 2003.February 16, 2011 – Valencia (Spain)
  25. 25. Web Indexing  Who  BIRI - BIoinformatics Resource Inventory  What  An approach for automatically discovering and indexing public bioinformatics resources  Where  The scientific literature G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V. Maojo. BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC Bioinformatics, Oct 7;10:320, 2009.February 16, 2011 – Valencia (Spain)
  26. 26. DB Search  Why  A wealth of bioinformatics tools and databases has been created over the last decade and most are freely available  Often it is desired to visualize the database hits stacked according to the query sequence  There is no inventory presenting an up-to-date and easily searchable index of all these resourcesFebruary 16, 2011 – Valencia (Spain)
  27. 27. DB Search  Who  MView – Multiple alignment Viewer  What  A tool for converting the result of a sequence database search into the form of a coloured multiple alignment of hits stacked against the query  Where  Multiple alignment N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible . database search or multiple alignment viewer. Bioinformatics, 14(4), pp. 380-381, 1998.February 16, 2011 – Valencia (Spain)
  28. 28. DB Search  Who  BioWareDB  What  An extensive and current catalog of software and DBs of relevance to researchers in the field of biology and medicine  Where  Current and available biomedical computing resources M.W. Matthiessen. BioWareDB: the biomedical software and database search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003.February 16, 2011 – Valencia (Spain)
  29. 29. Web Search  Why  Today, scientists can easily post their research findings on the Web or compare their discoveries with previous work  Manually maintaining a wrapper library will not scale to accommodate the growth of genomics data sources on the WebFebruary 16, 2011 – Valencia (Spain)
  30. 30. Web Search  Who  ---  What  An automated system able to find, classify, and wrap new sources without constant human intervention  Where  Distributed genomics data sources D. Rocco and T. Critchlow. Automatic discovery and classification of bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933, 2003.February 16, 2011 – Valencia (Spain)
  31. 31. Web Search  Who  GoPubMed  What  An ontology-based literature search applied to Gene Ontology (GO) and PubMed  Where  Scientific literature R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed: ontology-based literature search applied to gene ontology and PubMed. In Proceedings of German Bioinformatics Conference, pp. 169–178, 2004.February 16, 2011 – Valencia (Spain)
  32. 32. Information Filtering  Why  In the Web 2.0 scenario, users look for collaborative environments, in which they can meet further users with similar preferences and needs  Researchers need to search for and/or generate specialized datasets that meet specific requirementsFebruary 16, 2011 – Valencia (Spain)
  33. 33. Information Filtering  Who  ProDaMa-C Protein Dataset Management – Collaborative  What  A web application aimed at  Generating specialized protein structure datasets  Favouring the collaboration among researchers  Where  Protein structures G. Armano and A. Manconi. A Collaborative Web Application for Supporting Researchers in the Task of Generating Protein Datasets. Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A. Soro, E. Vargiu (eds.), Springer-Verlag, 2011.February 16, 2011 – Valencia (Spain)
  34. 34. Information Filtering  Who  Gene Recommender  What  An algorithm that ranks genes according to how strongly they correlate with a set of query genes  Where  Analysis of gene expression A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene recommender algorithm to identify coexpressed genes. Genome Research, Aug;13(8), pp. 1828-37, 2003.February 16, 2011 – Valencia (Spain)
  35. 35. Text Mining  Why  Web-based tools capable of filtering public DBs are more and more required  Interesting and useful information, relevant to the researcher, could appear in documents (e.g., papers) they have not read and therefore be missed entirely  Of paramount importance to DB search methods is a reliable means of distinguishing true hits from false hits  Biologists construct a pathway by reading a large number of articles and interpreting them a consistent network, but the link to the original article is missedFebruary 16, 2011 – Valencia (Spain)
  36. 36. Text Mining  Who  MedMiner  What  An Internet text mining tool that filters the literature and presents the most relevant portions in a well-organized way that facilitate understanding  Where  Gene expression profiling L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N. Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling. Biotechniques, Dec;27(6), pp. 1210-4, 1999.February 16, 2011 – Valencia (Spain)
  37. 37. Text Mining  Who  BioRAT  What  A research assistant that, given a query,  autonomously finds a set of papers  reads them  highlights the most relevant facts in each  Where  Scientific literature D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones. . BioRAT: Extracting biological information from full-length papers. Bioinformatics, 20(17), pp. 3206–3213, 2004.February 16, 2011 – Valencia (Spain)
  38. 38. Text Mining  Who  SAWTED – Structure Assignment With Text Description  What  An automated system to filtering DB hits  Where  Homologues annotation R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure assignment with text description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics, Feb;16(2), pp. 125-9, 2000.February 16, 2011 – Valencia (Spain)
  39. 39. Text Mining  Who  PathText  What  A system to integrate a pathway visualized, text mining systems and annotation tools into a seamless environment  Where  Pathway visualizations B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S. Ananiadou, and J. Tsujii. PathText: a text mining integrator for biological pathway visualizations. Bioinformatics, 26(12), pp. i374- i381, 2010.February 16, 2011 – Valencia (Spain)
  40. 40. Text Categorization  Why  Information in text form, such as MEDLINE records, is a greatly underutilized source of biological information  Individual researchers find it difficult to keep up with all the new, relevant information  Systems that extract structured information from natural language passages have been highly successful in specialized domains  Time is ripe for developing such applications for molecular biology and genomicsFebruary 16, 2011 – Valencia (Spain)
  41. 41. Text Categorization  Who  --  What  Constructing biological knowledge bases by extracting information from text sources  Where  MEDLINE M. Craven and J. Kumlien. Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, 1999.February 16, 2011 – Valencia (Spain)
  42. 42. Text Categorization  Who  Genies  What  A natural-language processing system for the extraction of molecular pathways  Where  Scientific publications C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies: . a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001.February 16, 2011 – Valencia (Spain)
  43. 43. Hierarchical Text Categorization  Why  A great deal of genomics information accumulated through years is available in online text repositories (such as MEDLINE)  These resources do not still provide adequate mechanisms for retrieving the required information  Traditional filtering techniques based on keyword search are often inadequate to express what the user is really searching for  Web repositories, such as Medical Subject Headings (MeSH) in MEDLINE, encompass an underlying taxonomyFebruary 16, 2011 – Valencia (Spain)
  44. 44. Hierarchical Text Categorization  Who  --  What  A tool for assisting biologists with literature search for the task of associating genes with Gene Ontology codes  Where  MEDLINE S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text categorization as a tool of associating genes with gene ontology codes. In 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, pp. 26–30, 2004.February 16, 2011 – Valencia (Spain)
  45. 45. Hierarchical Text Categorization  Who  Pub.MAS  What  A multiagent system for retrieving and classifying publications  Where  BMC Bioinformatics  PubMed Central G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for Retrieving Bioinformatics Publications from Web Sources. IEEE Transactions on Nanobioscience, Special Session on GRID, Web Services, Software Agents and Ontology Applications for Life Science, 6(2), pp. 104-109, 2007.February 16, 2011 – Valencia (Spain)
  46. 46. Case Study: Retrieving and Filtering Bioinformatic PublicationsFebruary 16, 2011 – Valencia (Spain)
  47. 47. An IR Task Information Extraction Online Repositories Wrapping Information Sources Extracted Data/Information Text Categorization Selected Data/Information Taxonomic Classification of Items Users Feedback Adaptive BehaviorFebruary 16, 2011 – Valencia (Spain)
  48. 48. Information Extraction  Essential to retrieve documents provided by heterogeneous and distributed sources A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) : A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp. 84–93.February 16, 2011 – Valencia (Spain)
  49. 49. Text Categorization  It is the task of determining and assigning topical labels to content  Typical approaches to text categorization  Statistical  Semantic  In the last years several researchers have investigated the use of hierarchies for text categorization F. Sebastiani. A tutorial on automated text categorisation. Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7- 35, 1999.February 16, 2011 – Valencia (Spain)
  50. 50. Users Feedback  It is aimed at dealing with any feedback provided by the user  In semiautomated classification and adaptive filtering we may expect the user of a classifier to provide feedback on how test documents have been classified  In this case further training may be performed during the operating phaseFebruary 16, 2011 – Valencia (Spain)
  51. 51. Hierarchical Text Categorization Hierarchical Text Categorization (HTC) deals with problems Hierarchical Text Categorization (HTC) deals with problems where categories are organized in the form of a hierarchy. where categories are organized in the form of a hierarchy. D. Koller, M. Sahami. Hierarchically classifying documents using very few words. Proceedings of 14th International Conference on Machine Learning, pp. 170– 178, 1997.February 16, 2011 – Valencia (Spain)
  52. 52. HTC at a Glance  HTC studies how to improve the performances provided by classical text categorization techniques by exploiting the knowledge of the taxonomic relationships among classesFebruary 16, 2011 – Valencia (Spain)
  53. 53. Motivations  People organize large collections of documents in hierarchies of topics, or arrange a large body of knowledge in ontologies  The main goal of automatic text categorization is to deal with underlying taxonomies  A hierarchical approach can give benefits in real-world scenarios, characterized by information overload and imbalanced dataFebruary 16, 2011 – Valencia (Spain)
  54. 54. HTC Approaches  Pachinko machine  At each level of the hierarchy  The classifier selects the one most probable category  It goes down the hierarchy inspecting only the children of the selected nodes  Probabilistic hierarchical local approach  At each level of the hierarchy  The classifier makes probabilistic decisions  It selects the leaf categories on the most probable paths S. Kiritchenko. Hierarchical text categorization and its application to bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006.February 16, 2011 – Valencia (Spain)
  55. 55. HTC Approaches Local classifier per node  Each classifier decides if forwarding the document to its children Local classifier per parent node  Each classifier decides to which subtree(s) the document should be sent to Local classifier per level  The number of outputs per level grows while going down through the taxonomy Global classifier  One classifier is trained, able to discriminate among all categories C.J. Silla and A. Freitas. A survey on hierarchical classification across different application domains. Journal of Data Mining and Knowledge Discovery, 2(1-2), pp. 31-72, 2010.February 16, 2011 – Valencia (Spain)
  56. 56. Progressive Filtering Progressive Filtering (PF) is a simple categorization technique that operates on hierarchically structured categories A way to implement PF consists of decomposing a given rooted taxonomy into pipelines, one for of each path that exists between the root and each node of the taxonomy Each node is a binary classifier able to recognize whether or not an input belongs to the corresponding class A threshold selection algorithm (TSA) can be run to identify an optimal, or sub-optimal, combination of thresholds for each pipeline A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to Perform Hierarchical Text Categorization in Presence of Input Imbalance. Proceedings of International Conference on KnowledgeFebruary 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010. Discovery and
  57. 57. PF at a Glance  Starting from the root, each input traverses the taxonomy as a “token”February 16, 2011 – Valencia (Spain)
  58. 58. Classifiers in PF  Partitioning the taxonomy in pipelines gives rise to a set of new classifiers, each represented by a pipelineFebruary 16, 2011 – Valencia (Spain)
  59. 59. Classifiers in PFFebruary 16, 2011 – Valencia (Spain)
  60. 60. Classifiers in PF  The same classifier may have different behaviours, depending on which pipeline it is embedded  Each pipeline can be considered in isolation from the othersFebruary 16, 2011 – Valencia (Spain)
  61. 61. Threshold Selection in PF  A relevant problem is how to calibrate the threshold of the binary classifiers embedded by each pipeline in order to optimize the pipeline behaviour  Searching for a optimal or sub-optimal combination of thresholds in a pipeline can be actually viewed as the problem of finding a maximum in a utility function F that depends on the corresponding threshold vector θFebruary 16, 2011 – Valencia (Spain)
  62. 62. TSA  For each pipeline the best combination of thresholds is calculated according to a bottom up algorithm that uses two functions  Repair which increases/decreases (↑ / ↓ the threshold until the ) utility function reaches a maximum  Calibrate which recursively operates downward from the given classifier by repeatedly calling repair (↑ / ↓) A. Addis, G. Armano, E. Vargiu. A comparative experimental assessment of a threshold selection algorithm in hierarchical text categorization. In: Advances in Information Retrieval. The 33rd European Conference on Information Retrieval (ECIR 2011), 2011February 16, 2011 – Valencia (Spain)
  63. 63. TSA: An ExampleFebruary 16, 2011 – Valencia (Spain)
  64. 64. The Prototype  MultiAgent Architecture  X.MAS  Agent Framework  JADE A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6, Sixth International Workshop, From Agent Theory to Agent Implementation, pp. 3–9, 2008. F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent Systems with JADE (Wiley Series in Agent Technology). John Wiley and Sons, 2007.February 16, 2011 – Valencia (Spain)
  65. 65. X.MAS at a Glance  Macro-architectureFebruary 16, 2011 – Valencia (Spain)
  66. 66. X.MAS at a Glance Information Agent Scheduler Source  Micro-architecture Middle Agent Scheduler Dispatcher Filter Agent Scheduler Actuator Middle Agent Scheduler Dispatcher Task Agent Scheduler Actuator Middle Agent Scheduler Dispatcher Interface Agent SchedulerFebruary 16, 2011 – Valencia (Spain)
  67. 67. Pub.MASFebruary 16, 2011 – Valencia (Spain)
  68. 68. Pub.MAS G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for Retrieving Bioinformatics Publications from Web Sources. IEEE Transactions on Nanobioscience, Special Session on GRID, Web Services, Software Agents and Ontology Applications for Life Science, 6(2), pp. 104-109, 2007.February 16, 2011 – Valencia (Spain)
  69. 69. Information Extraction  It is supported by a set of agents explicitly devoted to  wrap the selected information sources  encode the extracted documents  An information agent wraps BMC Bioinformatics web site  HTML wrapper  An information agent wraps PubMed Central digital archive  Web service wrapperFebruary 16, 2011 – Valencia (Spain)
  70. 70. Hierarchical Text Categorization  The PF approach previously described has been implemented  Document has been encoded to  remove all non-informative words  remove the most common morphological and inflexional suffixes  select the relevant features  generate a feature vector for each document  Classification is performed by wkNN classifiers  the score is assigned using non parametric density estimation of the “ a posteriori” probabilityFebruary 16, 2011 – Valencia (Spain)
  71. 71. The Adopted Taxonomy P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A. . Brass. An ontology for bioinformatics applications, Bioinformatics, 15(6), pp. 510–520, 1999.February 16, 2011 – Valencia (Spain)
  72. 72. The Adopted TaxonomyFebruary 16, 2011 – Valencia (Spain)
  73. 73. The Adopted TaxonomyFebruary 16, 2011 – Valencia (Spain)
  74. 74. Users Feedback  User feedback is aimed at dealing with any feedback provided by the user  Two solutions have been experimented  training an ANN  using a kNN classifierFebruary 16, 2011 – Valencia (Spain)
  75. 75. Experiments  Different kinds of tests have been performed, each aimed at highlighting a specific issue  we estimated the (normalized) confusion matrix for each classifier belonging to the highest level of the taxonomy  we studied the impact of taking into account pipelines of classifiers, also trying to assess whether a residual independence was in fact present  we assessed the solution devised for implementing user’s feedback, based on the k-NN techniqueFebruary 16, 2011 – Valencia (Spain)
  76. 76. Experiments  Tests have been performed using selected publications extracted from the BMC Bioinformatics site and from the PubMed Central digital archive  Publications have been classified by an expert of the domain according to the proposed taxonomy  For each item of the taxonomy, a set of about 100-150 articles has been selected to train the corresponding wk-NN classifier, and 300-400 articles have been used to test itFebruary 16, 2011 – Valencia (Spain)
  77. 77. ConclusionsFebruary 16, 2011 – Valencia (Spain)
  78. 78. Conclusions  Bioinformatics needs suitable, automated, and “ intelligent” solutions to acquire, analyse, organize, and store biological data  IR might be very useful to face with bioinformatics problems  Currently, few IR techniques have been adopted to solve some bioinformatics tasks  A system aimed at retrieving and filtering bioinformatics publications has been presented as case study  We argue that further investigations and experiments could be made to exploit IR in bioinformaticsFebruary 16, 2011 – Valencia (Spain)
  79. 79. Acknowledgments  This work was partially supported by the Italian Ministry of Education – Investment funds for basic research, under the project ITALBIONET – Italian Network of Bioinformatics  I wish to thank all the IASC Group members for their valuable help  IASC Group members are:  G. Armano – head  A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc  A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students  S. Curatti – collaborator, programmer  I wish to thank also Andrea Manconi for his suggestionsFebruary 16, 2011 – Valencia (Spain)
  80. 80. Thanks for your attention!Contact: Eloisa Vargiu vargiu@diee.unica.itFebruary 16, 2011 – Valencia (Spain)

×