This document discusses how bioinformatics research can benefit from techniques in information retrieval. It provides background on bioinformatics, information retrieval, and how the fields intersect. Specifically, it describes how indexing, searching, filtering, mining and categorizing large amounts of bioinformatics data and publications can help with tasks like acquiring, analyzing, organizing and storing biological information. The document also presents several case studies of specific tools and systems that apply IR techniques in bioinformatics.
SWISS-PROT- Protein Database- The Universal Protein Resource Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins.
SWISS-PROT- Protein Database- The Universal Protein Resource Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins.
Nadia Pisanti - With the recent New Genome Sequencing Technologies, Medicine and Biology are witnessing a revolution where Computer Science and Data Analysis play a crucial role. In this talk, I will give an overview of perspectives and challenges in this field.
Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble and analyze the function and structure of genomes
BITS: Overview of important biological databases beyond sequencesBITS
Module 4 Other relevant biological data sources beyond sequences
Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
After sequencing of the genome has been done, the first thing that comes to mind is "Where are the genes?". Genome annotation is the process of attaching information to the biological sequences. It is an active area of research and it would help scientists a lot to undergo with their wet lab projects once they know the coding parts of a genome.
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
Genomic databases are referred to as online repositories of genomic variants, described for a single (locus-specific) or more (general) genes or specifically for a population or ethnic group (national/ethnic).
NBITSearch is a search engine with an open API for local stations, LAN and Internet. Advantages over counterparts:
1. Object indexing. It allows to index objects S of any types T.
2. Multifunctional indexing. It allows to index objects simultaneously by set any functions F (S).
3. Very fast search. It allows to save time and money.
Nadia Pisanti - With the recent New Genome Sequencing Technologies, Medicine and Biology are witnessing a revolution where Computer Science and Data Analysis play a crucial role. In this talk, I will give an overview of perspectives and challenges in this field.
Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble and analyze the function and structure of genomes
BITS: Overview of important biological databases beyond sequencesBITS
Module 4 Other relevant biological data sources beyond sequences
Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
After sequencing of the genome has been done, the first thing that comes to mind is "Where are the genes?". Genome annotation is the process of attaching information to the biological sequences. It is an active area of research and it would help scientists a lot to undergo with their wet lab projects once they know the coding parts of a genome.
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
Genomic databases are referred to as online repositories of genomic variants, described for a single (locus-specific) or more (general) genes or specifically for a population or ethnic group (national/ethnic).
NBITSearch is a search engine with an open API for local stations, LAN and Internet. Advantages over counterparts:
1. Object indexing. It allows to index objects S of any types T.
2. Multifunctional indexing. It allows to index objects simultaneously by set any functions F (S).
3. Very fast search. It allows to save time and money.
Fastcatsearch's next major version search/index structure concept.
## Push documents and search right away.
* No full indexing
* Add document any time even when searching
* No indexing node, but master node
* Master node index document first and toss docs to other nodes
* Every node index their documents independently
* Master node checks other nodes indexing integrity in a cluster
A system was developed able to retrieve specific documents from a document collection. In this system the query is given in text by the user and then transformed into image. Appropriate features were in order to capture the general shape of the query, and ignore details due to noise or different fonts. In order to demonstrate the effectiveness of our system, we used a collection of noisy documents and we compared our results with those of a commercial OCR package.
Bioinformatics in biotechnology by kk sahu KAUSHAL SAHU
Introduction
Bioinformatics – definition
History
Required skills
Core areas of bioinformatics
Components of bioinformatics
Nomenclature system in bioinformatics
Biological databases
Types of database
Bioinformatics tools
Applications of bioinformatics
Conclusion
References
BIOLOGICAL DATABASES :
A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system.
The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information.
Example. A few popular databases are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information Resource.
IMPORTANCE OF DATABASES :
1. Databases act as a store house of information.
2. Databases are used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria.
3. It allows knowledge discovery, which refers to the identification of connections between pieces of information that were not known when the information was first entered. This facilitates the discovery of new biological insights from raw data.
4. Secondary databases have become the molecular biologist’s reference library over the past decade or so, providing a wealth of information on just about any gene or gene product that has been investigated by the research community.
5. It helps to solve cases where many users want to access the same entries of data.
6. Allows the indexing of data.
7. It helps to remove redundancy of data.
TYPES OF BIOLOGICAL DATABASES:
Biological databases are classified on
1. Based on content of biological data
2. Based on the nature of data.
1. BASED ON CONTENT OF BIOLOGICAL DATA :
Based on their contents, biological databases can be roughly divided into two categories:
1. Primary databases
2. Secondary databases
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONIJwest
With the growth of data-oriented research in humanities, a large number of research datasets have been
created and published through web services. However, how to discover, integrate and reuse these distributed
heterogeneous research datasets is a challenging task. Ontology is the soul between series digital humanities
resources, which provides a good way for people to discover and understand these datasets. With the release
of more and more linked open data and knowledge bases, a large number of ontologies have been produced
at the same time. These ontologies have different publishing formats, consumption patterns, and interactions
ways, which are not conductive to the user’s understanding of the datasets and the reuse of the ontologies.
The Ontology Service Center platform consists of Ontology Query Center and Ontology Validation Center,
mainly using linked data and ontology-based technologies. The Ontology Query Center realizes the functions
of ontology publishing, querying, data interaction and online browsing, while the Ontology Validation
Center can verify the status of using certain ontologies in the linked datasets. The empirical part of the paper
uses the Confucius portrait as an example of how OSC can be used in the semantic annotation of images. In
a word, the purpose of this paper is to construct the applied ecology of ontology to promote the development
of knowledge graphs and the spread of ontology.
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION dannyijwest
With the growth of data-oriented research in humanities, a large number of research datasets have been
created and published through web services. However, how to discover, integrate and reuse these distributed
heterogeneous research datasets is a challenging task. Ontology is the soul between series digital humanities
resources, which provides a good way for people to discover and understand these datasets. With the release
of more and more linked open data and knowledge bases, a large number of ontologies have been produced
at the same time
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkASIS&T
Research Data Access and Preservation Summit, 2014
San Diego, CA
March 26-28, 2014
Maryann Martone, Principal Investigator, Neuroscience Information Framework, University of California, San Diego
Biodiversity Informatics: An Interdisciplinary ChallengeBryan Heidorn
"Impacto de la Informática en el Conocimiento de la Biodiversidad: Actualidad y Futuro” at Universidad Nacional de Colombia on August 12, 2011. https://sites.google.com/site/simposioinformaticaicn/home
Facilitating semantic alignment of EMBL-EBI services using ontologies and semantic web technology. Presentation at the BioHackathon Symposium 2016, Japan.
Patient Empowerment from an Integrated Care ApproachEloisa Vargiu
In a context of growing incidence of chronic diseases and ageing populations, there is the need to research and find new solutions to shift resources into the community in an effort to deal more effectively with chronic conditions. In that direction, patient empowerment plays an important role. Approaches to increase patient empowerment vary from patient self-management programs, to promoting patient involvement in treatment shared decision-making, to facilitating the clinician-patient cooperation. This talk focuses on self-management as a way to engage and empower patients in order to improve their quality a life and to allow a better follow-up by clinicians. A practical example from the CONNECARE project will show how self-management could be put in practice for supporting complex chronic patients.
: In the CONNECARE project (ID: 689802), we focus on patient’s monitoring with the final goal of providing self-management features to people in needs, such as chronic patients. To be modular and scalable and to allow us to have a disease-independent system, the proposed solution is based on the software-engineering concept of microservices.
Automatic Support for Improving Management and Treatment of Patients with Obt...Eloisa Vargiu
To improve patients’ compliance and achieve better follow-up, we developed a system that, connecting the CPAP with Internet and providing patients with an app in their smartphone, gives support to both patients and pulmonologists.
Monitoring people that need assistance: the BackHome experienceEloisa Vargiu
People that need assistance, as for instance elderly or disabled people, may be affected by a decline in daily functioning that usually involves the reduction and discontinuity in daily routines and a worsening in the overall quality of life. Thus, there is the need to intelligent systems able to monitor indoor and outdoor activities of users to detect emergencies, recognize activities, send notifications, and provide a summary of all the relevant information. In this talk, a sensor-based telemonitoring system that addresses all that issues will be presented. Its goal is twofold: (i) helping and supporting people (e.g., elderly or disabled) at home; and (ii) giving a feedback to therapists, caregivers, and relatives about the evolution of the status, behavior and habits of each monitored user. The proposed system is part of the EU project BackHome and it is currently running in three end-user’s homes in Belfast. The overall experience in applying the system to monitor and assist people with severe disabilities will be illustrated and lessons learnt discussed.
Brain Computer Interfaces on Track to Home: Results and Lessons LearntEloisa Vargiu
Research efforts have improved Brain Computer Interface (BCI) technology in many ways and numerous applications have been prototyped. Motivated by the aim of restoring independence to individuals with severe disabilities, the focus has centred on developing applications for communication, movement control, environmental control, locomotion, as well as neurorehabilitation. Until recently, these BCI systems have been researched almost exclusively in laboratories. The EU project BackHome was aimed at moving BCIs from being laboratory devices for able-bodies users toward practical devices used at home by people with severe limited mobility. BackHome aimed to develop BCI systems into practical multimodal ATs to provide useful solutions for communication, Web access, leisure, cognitive stimulation and environmental control, and to provide this technology for home usage with minimal support. This talk presents the main outcomes of the BackHome project: (i) a modular and distributed architecture able to meet the requirements of a multi-functional BCI with remote home support; (ii) a novel BCI equipment with practical electrodes aimed at setting a new standard of lightness, autonomy, comfort and reliability; (iii) easy-to-use software tailored to people's needs to manage a complete range of multifunctional applications finely tuned for one-click command and adaptive usage; (iv) a telemonitoring and home support system to remotely monitor and assist BCI independent use; and (v) a Web-based application for therapists which offers remote services to plan and monitor BCI-based cognitive rehabilitation and pervasively assess the use of the system and the quality of life of the individual.
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Eloisa Vargiu
Decline in daily functioning usually involves the reduction and discontinuity in daily routines; entailing a considerable decrease of the quality of life (QoL). This is especially relevant for people that need assistance, as for instance elderly or disabled people and may also hide pathological (e.g., Alzheimer) and/or mental (e.g., depression or melancholia) conditions. Thus, there is the need to intelligent systems able to monitor users' activities to detect emergencies, recognize activities, send notifications, and provide a summary of all the relevant information. In this paper, we present a sensor-based telemonitoring system that addresses all that issues. Its goal is twofold: (i) helping and supporting people (e.g., elderly or disabled) at home; and (ii) giving a feedback to therapists, caregivers, and relatives about the evolution of the status, behavior and habits of each monitored user. Some features of the system have been evaluated with two health-users in Barcelona and results show good performance. Finally, the system has been adopted and installed in several end-users' homes under the umbrella of the projects SAAPHO and BackHome.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Bioinformatics Meets Information Retrieval: State of the Art and a Case Study
1. Bioinformatics Meets
Information Retrieval
State of the Art and a Case Study
Eloisa Vargiu
Intelligent Agents and Soft-Computing Group
Dept. of Electrical and Electronic Engineering
University of Cagliari, Italy
February 16, 2011 – Valencia (Spain) email: vargiu@diee.unica.it
2. My Background
2000 – 2004 2004 – 2009
Automatic planning Bioinformatics
Classic domains: HW[] Protein secondary structure
Dynamic domains: HIPE prediction: MASSP3 and
GAME/SSP
2000 - …
2006 - …
Multiage s te
nt ys ms
Information Retrieval
A Personalized Adaptive and
Cooperative Multiagent Hierarchical text
System: PACMAS categorization: PF and TSA
A generic architecture to Recommender systems and
perform information retrieval contextual advertising: ConCA
tasks: X.MAS
February 16, 2011 – Valencia (Spain)
3. Outline
Context and Mission
Why Bioinformatics Needs Information Retrieval
Bioinformatics Meets Information Retrieval
Case Study: Retrieving and Filtering Bioinformatics Publications
Conclusions
February 16, 2011 – Valencia (Spain)
5. Web Evolution
Web 1.0 1993
Source of information
Personal homepages
Web 2.0 2004
Social networks
(Micro)Blogging
Web 3.0 2005
Semantic web
Web composition
February 16, 2011 – Valencia (Spain)
6. Web Evolution and Bioinformatics
A long time ago...
Data was stored in local DBs
Data was shared as flat files
Biologists worked alone or in small groups
February 16, 2011 – Valencia (Spain)
7. Web Evolution and Bioinformatics
Today...
Online repositories
The major sources of nucleotide sequence are the ones belonging to the
International Nucleotide Sequence Database Collaboration
DDBJ (DNA DataBank of Japan)
EMBL (European Molecular Biology Laboratory)
GenBank (NIH genetic sequence database)
Web services
Basic bioinformatics services are
classified by the EBI into three categories
SSS (Sequence Search Services)
MSA (Multiple Sequence Alignment)
BSA (Biological Sequence Analysis)
February 16, 2011 – Valencia (Spain)
8. Web Evolution and Scientific
Publications
A long time ago...
Publications were consulted at the library
Just two or three relevant available journals
Manual selection of relevant publications
February 16, 2011 – Valencia (Spain)
9. Web Evolution and Scientific
Publications
Today...
Online journals
Online conference proceedings
Publications are often available for free
Manual selection of relevant publications
becomes unfeasible
February 16, 2011 – Valencia (Spain)
10. As a Consequence...
Unstructured information
Information overload
Personalized information selection and input imbalance
February 16, 2011 – Valencia (Spain)
11. Our Mission
To cope with
Unstructured information, classifying documents according to a
given taxonomy
Information overload, filtering information to reduce redundancy
Personalized information selection and input imbalance, filtering
information according to user preferences
Case study
Retrieving and filtering bioinformatics publications
February 16, 2011 – Valencia (Spain)
12. Research Topics
Information Retrieval
Bioinformatics
February 16, 2011 – Valencia (Spain)
13. Information Retrieval
Information Retrieval (IR) deals with the representation,
Information Retrieval (IR) deals with the representation,
storage, organization of, and access to information items.
storage, organization of, and access to information items.
The user must first translate this information need into a query
The user must first translate this information need into a query
which can be processed by an IR system.
which can be processed by an IR system.
Given the user query, the key goal of an IR system is to retrieve
Given the user query, the key goal of an IR system is to retrieve
information which might be useful or relevant to the user.
information which might be useful or relevant to the user.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
New York: Addison-Wesley, 1999.
February 16, 2011 – Valencia (Spain)
14. Main IR Topics
Indexing
Search and Web Search
Information Filtering
Text Mining
Text Categorization and Hierarchical Text Categorization
February 16, 2011 – Valencia (Spain)
15. Bioinformatics
Bioinformatics is the field of science in which biology,
Bioinformatics is the field of science in which biology,
computer science, and information technology merge to form a
computer science, and information technology merge to form a
single discipline.
single discipline.
The ultimate goal of the field is to enable the discovery of new
The ultimate goal of the field is to enable the discovery of new
biological insights as well as to create a global perspective from
biological insights as well as to create a global perspective from
which unifying principles in biology can be discerned.
which unifying principles in biology can be discerned.
National Center for Biotechnology Information (NCBI),
http://www.ncbi.nlm.nih.gov/.
February 16, 2011 – Valencia (Spain)
16. Main Bioinformatics Research Areas
Sequence analysis
Genome annotation
Computational evolutionary biology
Analysis of gene expression
Analysis of protein expression
Analysis of mutations in cancer
Comparative genomics
Modelling biological systems
Prediction of protein structure
Molecular interaction
February 16, 2011 – Valencia (Spain)
17. Why Bioinformatics
Needs
Information Retrieval
February 16, 2011 – Valencia (Spain)
18. Does Bioinformatics Need IR?
Bioinformatics is concerned with researching, developing and
applying tools and methods to acquire, analyse, organize and
store biological and medical data
Indexing and search techniques may help in the task of acquiring
Information filtering, text mining and text categorization
techniques may be useful to the analysis of data
Text categorization, with particular reference to hierarchical text
categorization, may be used in the organization and storage tasks
February 16, 2011 – Valencia (Spain)
19. Bioinformatics Data
A very huge amount of of data to be
Indexed
Searched for in large databases or on the web
Filtered according to users' preferences
Text mined
Categorized according to its textual content
February 16, 2011 – Valencia (Spain)
20. DB Indexing
Why
Data types are relegated to blob and unstructured text fields
Few results in building persistent access paths to support fast
retrieval methods
Genomic datasets in public repositories are annotated with free-text
fields describing the pathological state of the studied sample
Annotations are not mapped to concept in any ontology
February 16, 2011 – Valencia (Spain)
21. DB Indexing
Who
MoBIoS – Molecular Biological Information System
What
A specialized database management system
The storage manager is based on metric-space indexing
Query language entails biological data types
Where
Sequence homology: local alignment and mutations
D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to
Support Biological Discovery. Proceedings of the International
Conference on Scientific and Statistical Database Management
Systems, 2003.
February 16, 2011 – Valencia (Spain)
22. DB Indexing
Who
--
What
Ontology-driven indexing of public datasets for translational
bioinformatics
Methods to map text annotations of gene expression datasets to
concept in the UMLS
Where
Gene Expression Omnibus
Standford Tissue Microarray Database
N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A.
.
Musen. Ontology-driven indexing of public datasets for translational
bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009.
February 16, 2011 – Valencia (Spain)
23. Web Indexing
Why
Most often sequence retrieval tools and sequence analysis tools are
separated
The usage of sequence DBs is often general and limited to
keyword searching and entry retrieval
Discovering and accessing the appropriate bioinformatics resource
for a specific task has become increasingly important
February 16, 2011 – Valencia (Spain)
24. Web Indexing
Who
SIRW – A Web Server for Simple Indexing and Retrieval System
What
A WWW interface to the Simple Indexing and Retrieval (SIR)
system to parse and index flat file DBs
A framework for doing sequence analysis for selected biological
sequences
Where
Sequence analysis: motif pattern searches
C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval
System that combines sequence motif searches with keyword searches.
Nucleic Acids Research, 31(13). pp. 3771-3774, 2003.
February 16, 2011 – Valencia (Spain)
25. Web Indexing
Who
BIRI - BIoinformatics Resource Inventory
What
An approach for automatically discovering and indexing public
bioinformatics resources
Where
The scientific literature
G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V.
Maojo. BIRI: a new approach for automatically discovering and
indexing available public bioinformatics resources from the literature.
BMC Bioinformatics, Oct 7;10:320, 2009.
February 16, 2011 – Valencia (Spain)
26. DB Search
Why
A wealth of bioinformatics tools and databases has been created
over the last decade and most are freely available
Often it is desired to visualize the database hits stacked according
to the query sequence
There is no inventory presenting an up-to-date and easily
searchable index of all these resources
February 16, 2011 – Valencia (Spain)
27. DB Search
Who
MView – Multiple alignment Viewer
What
A tool for converting the result of a sequence database search into
the form of a coloured multiple alignment of hits stacked against
the query
Where
Multiple alignment
N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible
.
database search or multiple alignment viewer. Bioinformatics, 14(4), pp.
380-381, 1998.
February 16, 2011 – Valencia (Spain)
28. DB Search
Who
BioWareDB
What
An extensive and current catalog of software and DBs of relevance
to researchers in the field of biology and medicine
Where
Current and available biomedical computing resources
M.W. Matthiessen. BioWareDB: the biomedical software and database
search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003.
February 16, 2011 – Valencia (Spain)
29. Web Search
Why
Today, scientists can easily post their research findings on the Web
or compare their discoveries with previous work
Manually maintaining a wrapper library will not scale to
accommodate the growth of genomics data sources on the Web
February 16, 2011 – Valencia (Spain)
30. Web Search
Who
---
What
An automated system able to find, classify, and wrap new sources
without constant human intervention
Where
Distributed genomics data sources
D. Rocco and T. Critchlow. Automatic discovery and classification of
bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933,
2003.
February 16, 2011 – Valencia (Spain)
31. Web Search
Who
GoPubMed
What
An ontology-based literature search applied to Gene Ontology
(GO) and PubMed
Where
Scientific literature
R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed:
ontology-based literature search applied to gene ontology and PubMed.
In Proceedings of German Bioinformatics Conference, pp. 169–178,
2004.
February 16, 2011 – Valencia (Spain)
32. Information Filtering
Why
In the Web 2.0 scenario, users look for collaborative environments,
in which they can meet further users with similar preferences and
needs
Researchers need to search for and/or generate specialized datasets
that meet specific requirements
February 16, 2011 – Valencia (Spain)
33. Information Filtering
Who
ProDaMa-C Protein Dataset Management – Collaborative
What
A web application aimed at
Generating specialized protein structure datasets
Favouring the collaboration among researchers
Where
Protein structures
G. Armano and A. Manconi. A Collaborative Web Application for
Supporting Researchers in the Task of Generating Protein Datasets.
Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A.
Soro, E. Vargiu (eds.), Springer-Verlag, 2011.
February 16, 2011 – Valencia (Spain)
34. Information Filtering
Who
Gene Recommender
What
An algorithm that ranks genes according to how strongly they
correlate with a set of query genes
Where
Analysis of gene expression
A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene
recommender algorithm to identify coexpressed genes. Genome
Research, Aug;13(8), pp. 1828-37, 2003.
February 16, 2011 – Valencia (Spain)
35. Text Mining
Why
Web-based tools capable of filtering public DBs are more and more
required
Interesting and useful information, relevant to the researcher, could
appear in documents (e.g., papers) they have not read and therefore
be missed entirely
Of paramount importance to DB search methods is a reliable
means of distinguishing true hits from false hits
Biologists construct a pathway by reading a large number of
articles and interpreting them a consistent network, but the link to
the original article is missed
February 16, 2011 – Valencia (Spain)
36. Text Mining
Who
MedMiner
What
An Internet text mining tool that filters the literature and presents
the most relevant portions in a well-organized way that facilitate
understanding
Where
Gene expression profiling
L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N.
Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical
Information, with Application to Gene Expression Profiling.
Biotechniques, Dec;27(6), pp. 1210-4, 1999.
February 16, 2011 – Valencia (Spain)
37. Text Mining
Who
BioRAT
What
A research assistant that, given a query,
autonomously finds a set of papers
reads them
highlights the most relevant facts in each
Where
Scientific literature
D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones.
.
BioRAT: Extracting biological information from full-length papers.
Bioinformatics, 20(17), pp. 3206–3213, 2004.
February 16, 2011 – Valencia (Spain)
38. Text Mining
Who
SAWTED – Structure Assignment With Text Description
What
An automated system to filtering DB hits
Where
Homologues annotation
R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure
assignment with text description-enhanced detection of remote
homologues with automated SWISS-PROT annotation comparisons.
Bioinformatics, Feb;16(2), pp. 125-9, 2000.
February 16, 2011 – Valencia (Spain)
39. Text Mining
Who
PathText
What
A system to integrate a pathway visualized, text mining systems
and annotation tools into a seamless environment
Where
Pathway visualizations
B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S.
Ananiadou, and J. Tsujii. PathText: a text mining integrator for
biological pathway visualizations. Bioinformatics, 26(12), pp. i374-
i381, 2010.
February 16, 2011 – Valencia (Spain)
40. Text Categorization
Why
Information in text form, such as MEDLINE records, is a greatly
underutilized source of biological information
Individual researchers find it difficult to keep up with all the new,
relevant information
Systems that extract structured information from natural language
passages have been highly successful in specialized domains
Time is ripe for developing such applications for molecular biology
and genomics
February 16, 2011 – Valencia (Spain)
41. Text Categorization
Who
--
What
Constructing biological knowledge bases by extracting information
from text sources
Where
MEDLINE
M. Craven and J. Kumlien. Constructing Biological Knowledge Bases
by Extracting Information from Text Sources. In Proceedings of the 7th
International Conference on Intelligent Systems for Molecular Biology,
1999.
February 16, 2011 – Valencia (Spain)
42. Text Categorization
Who
Genies
What
A natural-language processing system for the extraction of
molecular pathways
Where
Scientific publications
C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies:
.
a natural-language processing system for the extraction of molecular
pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001.
February 16, 2011 – Valencia (Spain)
43. Hierarchical Text Categorization
Why
A great deal of genomics information accumulated through years is
available in online text repositories (such as MEDLINE)
These resources do not still provide adequate mechanisms for
retrieving the required information
Traditional filtering techniques based on keyword search are often
inadequate to express what the user is really searching for
Web repositories, such as Medical Subject Headings (MeSH) in
MEDLINE, encompass an underlying taxonomy
February 16, 2011 – Valencia (Spain)
44. Hierarchical Text Categorization
Who
--
What
A tool for assisting biologists with literature search for the task of
associating genes with Gene Ontology codes
Where
MEDLINE
S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text
categorization as a tool of associating genes with gene ontology codes.
In 2nd European Workshop on Data Mining and Text Mining for
Bioinformatics, pp. 26–30, 2004.
February 16, 2011 – Valencia (Spain)
45. Hierarchical Text Categorization
Who
Pub.MAS
What
A multiagent system for retrieving and classifying publications
Where
BMC Bioinformatics
PubMed Central
G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
Retrieving Bioinformatics Publications from Web Sources. IEEE
Transactions on Nanobioscience, Special Session on GRID, Web
Services, Software Agents and Ontology Applications for Life Science,
6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
46. Case Study:
Retrieving and Filtering
Bioinformatic Publications
February 16, 2011 – Valencia (Spain)
47. An IR Task
Information Extraction
Online Repositories
Wrapping Information Sources
Extracted Data/Information
Text Categorization
Selected Data/Information Taxonomic Classification of Items
User's Feedback
Adaptive Behavior
February 16, 2011 – Valencia (Spain)
48. Information Extraction
Essential to retrieve documents provided by heterogeneous and
distributed sources
A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) :
A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp.
84–93.
February 16, 2011 – Valencia (Spain)
49. Text Categorization
It is the task of determining and assigning topical labels to
content
Typical approaches to text categorization
Statistical
Semantic
In the last years several researchers have investigated the use of
hierarchies for text categorization
F. Sebastiani. A tutorial on automated text categorisation. Proceedings
of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7-
35, 1999.
February 16, 2011 – Valencia (Spain)
50. Users' Feedback
It is aimed at dealing with any feedback provided by the user
In semiautomated classification and adaptive filtering we may
expect the user of a classifier to provide feedback on how test
documents have been classified
In this case further training may be performed during the
operating phase
February 16, 2011 – Valencia (Spain)
51. Hierarchical Text Categorization
Hierarchical Text Categorization (HTC) deals with problems
Hierarchical Text Categorization (HTC) deals with problems
where categories are organized in the form of a hierarchy.
where categories are organized in the form of a hierarchy.
D. Koller, M. Sahami. Hierarchically classifying documents using very
few words. Proceedings of 14th International Conference on Machine
Learning, pp. 170– 178, 1997.
February 16, 2011 – Valencia (Spain)
52. HTC at a Glance
HTC studies how to improve the performances provided by
classical text categorization techniques by exploiting the
knowledge of the taxonomic relationships among classes
February 16, 2011 – Valencia (Spain)
53. Motivations
People organize large collections of documents in hierarchies of
topics, or arrange a large body of knowledge in ontologies
The main goal of automatic text categorization is to deal with
underlying taxonomies
A hierarchical approach can
give benefits in real-world
scenarios, characterized by
information overload and
imbalanced data
February 16, 2011 – Valencia (Spain)
54. HTC Approaches
Pachinko machine
At each level of the hierarchy
The classifier selects the one most probable category
It goes down the hierarchy inspecting only the children of the selected
nodes
Probabilistic hierarchical local approach
At each level of the hierarchy
The classifier makes probabilistic decisions
It selects the leaf categories on the most probable paths
S. Kiritchenko. Hierarchical text categorization and its application to
bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006.
February 16, 2011 – Valencia (Spain)
55. HTC Approaches
Local classifier per node
Each classifier decides if forwarding the document to its children
Local classifier per parent node
Each classifier decides to which subtree(s) the document should be
sent to
Local classifier per level
The number of outputs per level grows while going down through
the taxonomy
Global classifier
One classifier is trained, able to discriminate among all categories
C.J. Silla and A. Freitas. A survey on hierarchical classification across
different application domains. Journal of Data Mining and Knowledge
Discovery, 2(1-2), pp. 31-72, 2010.
February 16, 2011 – Valencia (Spain)
56. Progressive Filtering
Progressive Filtering (PF) is a simple categorization technique
that operates on hierarchically structured categories
A way to implement PF consists of decomposing a given rooted
taxonomy into pipelines, one for of each path that exists between
the root and each node of the taxonomy
Each node is a binary classifier able to recognize whether or not
an input belongs to the corresponding class
A threshold selection algorithm (TSA) can be run to identify an
optimal, or sub-optimal, combination of thresholds for each
pipeline
A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to
Perform Hierarchical Text Categorization in Presence of Input
Imbalance. Proceedings of International Conference on Knowledge
February 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010.
Discovery and
57. PF at a Glance
Starting from the root, each input traverses the taxonomy as a
“token”
February 16, 2011 – Valencia (Spain)
58. Classifiers in PF
Partitioning the taxonomy in pipelines gives rise to a set of new
classifiers, each represented by a pipeline
February 16, 2011 – Valencia (Spain)
60. Classifiers in PF
The same classifier may have different behaviours, depending on
which pipeline it is embedded
Each pipeline can be considered in isolation from the others
February 16, 2011 – Valencia (Spain)
61. Threshold Selection in PF
A relevant problem is how to calibrate the threshold of the
binary classifiers embedded by each pipeline in order to
optimize the pipeline behaviour
Searching for a optimal or sub-optimal combination of
thresholds in a pipeline can be actually viewed as the problem of
finding a maximum in a utility function F that depends on the
corresponding threshold vector θ
February 16, 2011 – Valencia (Spain)
62. TSA
For each pipeline the best combination of thresholds is
calculated according to a bottom up algorithm that uses two
functions
Repair which increases/decreases (↑ / ↓ the threshold until the
)
utility function reaches a maximum
Calibrate which recursively operates downward from the given
classifier by repeatedly calling repair (↑ / ↓)
A. Addis, G. Armano, E. Vargiu. A comparative experimental
assessment of a threshold selection algorithm in hierarchical text
categorization. In: Advances in Information Retrieval. The 33rd
European Conference on Information Retrieval (ECIR 2011), 2011
February 16, 2011 – Valencia (Spain)
64. The Prototype
MultiAgent Architecture
X.MAS
Agent Framework
JADE
A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent
Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6,
Sixth International Workshop, From Agent Theory to Agent
Implementation, pp. 3–9, 2008.
F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent
Systems with JADE (Wiley Series in Agent Technology). John Wiley
and Sons, 2007.
February 16, 2011 – Valencia (Spain)
65. X.MAS at a Glance
Macro-architecture
February 16, 2011 – Valencia (Spain)
66. X.MAS at a Glance
Information Agent
Scheduler Source
Micro-architecture
Middle Agent
Scheduler Dispatcher
Filter Agent
Scheduler Actuator
Middle Agent
Scheduler Dispatcher
Task Agent
Scheduler Actuator
Middle Agent
Scheduler Dispatcher
Interface Agent
Scheduler
February 16, 2011 – Valencia (Spain)
68. Pub.MAS
G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
Retrieving Bioinformatics Publications from Web Sources. IEEE
Transactions on Nanobioscience, Special Session on GRID, Web
Services, Software Agents and Ontology Applications for Life Science,
6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
69. Information Extraction
It is supported by a set of agents explicitly devoted to
wrap the selected information sources
encode the extracted documents
An information agent wraps BMC Bioinformatics web site
HTML wrapper
An information agent wraps PubMed Central digital archive
Web service wrapper
February 16, 2011 – Valencia (Spain)
70. Hierarchical Text Categorization
The PF approach previously described has been implemented
Document has been encoded to
remove all non-informative words
remove the most common morphological and inflexional suffixes
select the relevant features
generate a feature vector for each document
Classification is performed by wkNN classifiers
the score is assigned using non parametric density estimation of the
“ a posteriori” probability
February 16, 2011 – Valencia (Spain)
71. The Adopted Taxonomy
P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A.
.
Brass. An ontology for bioinformatics applications, Bioinformatics,
15(6), pp. 510–520, 1999.
February 16, 2011 – Valencia (Spain)
74. Users' Feedback
User feedback is aimed at dealing with any feedback provided
by the user
Two solutions have been experimented
training an ANN
using a kNN classifier
February 16, 2011 – Valencia (Spain)
75. Experiments
Different kinds of tests have been performed, each aimed at
highlighting a specific issue
we estimated the (normalized) confusion matrix for each classifier
belonging to the highest level of the taxonomy
we studied the impact of taking into account pipelines of
classifiers, also trying to assess whether a residual independence
was in fact present
we assessed the solution devised for implementing user’s feedback,
based on the k-NN technique
February 16, 2011 – Valencia (Spain)
76. Experiments
Tests have been performed using selected publications extracted
from the BMC Bioinformatics site and from the PubMed Central
digital archive
Publications have been classified by an expert of the domain
according to the proposed taxonomy
For each item of the taxonomy, a set of about 100-150 articles
has been selected to train the corresponding wk-NN classifier,
and 300-400 articles have been used to test it
February 16, 2011 – Valencia (Spain)
78. Conclusions
Bioinformatics needs suitable, automated, and “ intelligent”
solutions to acquire, analyse, organize, and store biological data
IR might be very useful to face with bioinformatics problems
Currently, few IR techniques have been adopted to solve some
bioinformatics tasks
A system aimed at retrieving and filtering bioinformatics
publications has been presented as case study
We argue that further investigations and experiments could be
made to exploit IR in bioinformatics
February 16, 2011 – Valencia (Spain)
79. Acknowledgments
This work was partially supported by the Italian Ministry of
Education – Investment funds for basic research, under the
project ITALBIONET – Italian Network of Bioinformatics
I wish to thank all the IASC Group members for their valuable
help
IASC Group members are:
G. Armano – head
A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc
A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students
S. Curatti – collaborator, programmer
I wish to thank also Andrea Manconi for his suggestions
February 16, 2011 – Valencia (Spain)
80. Thanks for your
attention!
Contact: Eloisa Vargiu vargiu@diee.unica.it
February 16, 2011 – Valencia (Spain)