With the explosion of biological data in the postgenomic era, there has been a growing need for semantic data integration, supported by ontologies. Semantic integration techniques enable biologists to construct complex biological queries. However, the construction of these queries and analysis of their results can place a high cognitive load on biologists. This paper presents a proposed information visualisation tool, Digr, to aid biologists in these processes within the context of DigraBase, a graph database for semantic data integration. A working example of a query is presented, to illustrate the complexity of the information spaces under consideration. Visualisation techniques that have been applied to similar problems are discussed in the context of their applicability to the problem of aiding the construction of complex queries over DigraBase, and the interpretation of their results.
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Knowledge Driven User Interfaces for Complex Biological Queries
1. Knowledge Driven User Interfaces for Complex Biological Queries
KIERAN O’NEILL, ALEXANDER GARCIA-CASTRO†
and DANIEL JACOBSON
National Bioinformatics Network, Central Node
†
and The International Center for Tropical Agriculture (CIAT)
With the explosion of biological data in the postgenomic era, there has been a growing need for semantic data integration, supported
by ontologies. Semantic integration techniques enable biologists to construct complex biological queries. However, the construction
of these queries and analysis of their results can place a high cognitive load on biologists. This paper presents a proposed information
visualisation tool, Digr, to aid biologists in these processes within the context of DigraBase, a graph database for semantic data
integration. A working example of a query is presented, to illustrate the complexity of the information spaces under consideration.
Visualisation techniques that have been applied to similar problems are discussed in the context of their applicability to the problem
of aiding the construction of complex queries over DigraBase, and the interpretation of their results.
General Terms: Data Visualisation, Comparative Genomics
Additional Key Words and Phrases: Information Visualisation, Complex Biological Queries, Information Integration
Introduction
In the biological domain, there has been a shift from hypothesis-driven research, wherein data is collected purely
to answer a scientific question, to data-driven research, wherein large data sets are collected and made publicly
available for analysis and interpretation [Searls 2005]. This has resulted in an explosion in the amount of
molecular biological data that is publicly available. This data is stored in at least 858 databases [Galperin 2006],
using differing formats, schemata and query software [Wong 2002]. To enable biologists to fully leverage this
data, and the information it contains, the integration of data from disparate sources is essential.
The syntactic, or ‘low level’ [Searls 2005] integration of data is a problem that has been addressed by systems
such as Sequence Retrieval Service (SRS)[Etzold and Argos 1992], Entrez [Schuler et al. 1996] and others which
overcome heterogeneity in the structure of data [Garcia-Castro et al. 2005]. However, it has become clear that
there is a further need for the integration of the meaning contained within biological data, in other words semantic
[Garcia-Castro et al. 2005] or ‘higher level’ [Searls 2005] data integration.
As an example of the importance of overcoming semantic differences, the definition of the word ‘gene’ can be
considered: In three different databases, the term carries three different meanings, each dependent on the context
[Garcia-Castro et al. 2005]. Resolution of such semantic disagreements is an important aspect of semantic data
integration [Garcia-Castro et al. 2005].
Bio-ontologies provide a means to facilitate semantic integration. An ontology can be regarded as ‘a type of
knowledge base in which concepts and relations are stored’ [Garcia-Castro et al. 2005]. Gene Ontology (GO)
[Ashburner et al. 2000], has emerged as a de facto standard molecular biological ontology [Garcia-Castro et al.
2005]. GO captures the function of gene products in terms of their involvement in biological processes, the cellular
component they function in, and their molecular function. GO has been used to annotate genes across multiple
organisms, and thus can aid in cross-organism semantic integration. In addition to GO, specific genome projects,
such as the mouse (Mus musculus)[Blake et al. 2006], fly (family Drosophilidae)[Drysdale et al. 2005] and worm
(Coenhorabdtis elegans) [Schwarz et al. 2006], have created, or are creating, their own domain ontologies for
capturing knowledge within their organism, such as anatomy ontologies, and phenotype ontologies (capturing
the effects of gene knockout)[Smith et al. 2005].
TAMBIS [Goble et al. 2001] illustrates another use for ontologies: that of facilitating biological query construc-
tion. TAMBIS provides users with a conceptual view of the information sources their query will be performed
over, while shielding them from the underlying schema of those sources. This allows TAMBIS to present a uni-
form interface to multiple data sources. The TAMBIS ontology also differs from GO and the domain ontologies
in that it represents higher level relations between concepts beyond the subsumption relations which the other
ontologies limit themselves to.
Once queries have been constructed and results returned, biologists still need to make sense of the results.
Author Addresses: K O’Neill, NBN Central, Cape Town, South Africa, kieran@nbn.ac.za
C Garcia Castro, Centro Internacional de Agricultura Tropical, Cali, Columbia D Jacobson, NBN Central, as above
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided
that the copies are not made or distributed for profit or commercial advantage, that the copies bear this notice and the full citation
on the first page. c 2007
Proceedings of SABioinf 2007, Pages 111–115
2. 112 • Kieran O’Neill et al.
Often, the research process involves many iterations of querying, analysis and optimisation before the desired
result is achieved [Garcia-Castro et al. 2005]. However, the amount of information returned from such queries
is usually more than a human being can process mentally at once. It is necessary to provide cognitive support,
wherein some of the cognitive load is taken on by the tool, thus making it easier for the user to see previously
hidden associations as well as feasible operations [Walenstein 2002]. Information visualisation (IV) techniques
can provide this cognitive support, enabling users to more easily make sense of the results of queries, and to
construct new, refined ones [Tao et al. 2004].
Information visualisation techniques have been defined as ‘the use of computer-supported, interactive, visual
representations of abstract data to amplify cognition’ [Card et al. 1999]. They include such techniques as
providing overviews of a data sets to users, enabling users to control how much and what information is displayed,
keeping a history to enable users to retrace their steps and explore, enabling users to view relationships between
the data displayed and other data, and allowing users to extract their data into a form they can use with other
software, or pass on to their peers [Shneiderman 1996]. In this paper, these tasks will be examined in the context
of building complex biological queries.
Architecture
DigraBase (DGB) is a graph database under development by the central node of the National Bioinformatics
Network (of South Africa) [Otgaar et al. 2006]. This system allows the execution of boolean declarative queries
over data loaded into it, which can be loaded from multiple sources. This gives biologists the ability to execute
complex biological queries, wherein subsets of biological objects, such as genes, are found, based on common
properties, such as function. Since these queries can be formulated across multiple data sets, this allows biologists
to find relationships between objects that were not obvious simply by looking at one database.
DGB uses a custom query language, DGBQL, and has a command line interface. However, the query language
is complex, and learning new languages, as well as using a command line interface is difficult for most biologists
[Letondal 2001a; 2001b]. A simple, graphical interface to enable users to easily construct queries and interpret
the results, while shielding them from the complexities of the underlying software, is needed.
Digr is a proposed frontend to DGB, with the purpose of filling that need. As such, it is being constructed
according to the Model View Controller (MVC) architectural design pattern. The model component is responsible
for interaction with DGB via DGBQL and presenting results in a format usable by the view component. The
view component takes these results and presents them visually to the user. The controller component accepts
commands from the user, made via the user interface, and sends them to the model. Thus, the model can easily
be altered if the query language changes, and different versions implemented to handle different methods by
which DGBQL will be transmitted, such as CORBA and XML-RPC. The visual component of Digr can also be
altered or replaced, without affecting the rest of the system. Finally, the model/ controller parts of the system
can be provided as a library for other software to use.
A Motivating Scenario
An example of a complex query is shown below, and illustrated in Figure 1. Phrases which can be represented
by ontology terms, as well as their contexts, are highlighted in bold. The desired data to be retrieved (human
genes) is italicised:
‘Retrieve all human genes that are normally expressed in the brain and are associated with poor memory
in mice and have a role in fatty acid metabolism.’
This query contains 3 ontology terms from different ontologies. ‘Poor memory in mice’ is represented by the
Mammalian Phenotype Ontology (MPO) term ‘abnormal learning/memory’ [Smith et al. 2005]. ‘Fatty acid
metabolism’ is represented by the GO term ‘fatty acid metabolism’, part of the GO biological process ontology
[Ashburner et al. 2000]. The query can be satisfied by finding genes directly annotated with one of these terms,
or one of their child terms (specialisations). (For instance, ‘linoleic acid metabolism’ is a type of fatty acid
metabolism, so genes annotated with ‘linoleic acid metabolism’ are also transitively annotated with ‘fatty acid
metabolism’.) ‘Normally expressed in the brain’ can be fulfilled by finding genes expressed in expressed sequence
tagi (EST) libraries made from tissue extracted from the brain (represented by annotation of the library with
an anatomy ontology term ‘brain’). This constraint is more complex, requiring transition over an intermediary
layer (EST libraries) between the ontology and genes.
For associations between genes and these ontology terms, several data sources are available. The Institute for
Genomic Research (TIGR) Gene Indices database contains expression data for human genes [Lee et al. 2005].
Proceedings of SABioinf 2007
3. Knowledge Driven User Interfaces for Complex Biological Queries • 113
Figure 1. An illustration of the example query within the context of its information space. The three ontologies from which the
terms are taken are shown as three-dimensional boxes. Within these, the terms chosen, as well as their immediate parents, and a
few of their immediate child terms, are shown. Large arrows show the connection between the chosen terms embodied by the query.
The EMBL/DDBJ/GenBank nucleotide database [Kanz et al. 2005] is cross-referenced with the GO Annotations
database [Camon et al. 2004]. The Mouse Genome database (MGDB) is annotated with both GO and MPO
[Blake et al. 2006]. Additionally, MGDB has orthology mappings between mouse and human genes, which can
be used to find genes by the terms their orthologs are annotated with.
Visualisation Challenges
When building complex queries, selecting ontology terms and relationships is one of the major bottlenecks.
Visualising large ontologies is not easy - the process grows in complexity when formulating queries that involve
more than one ontology. Ontologies are complex directed acyclic graph (DAG) structures, and enabling users to
find terms within them corresponding to the idea of the query in their mind, is challenging.
One approach to finding ontology terms is ‘top-down browsing’ through the relationships within the ontology.
Ontology browsers, such as AmiGO [Ashburner et al. 2000], and ontology editors, such as DAGedit, accomplish
this using a collapsible tree, as used in some file browsers. Disadvantages of this approach are that the number of
children shown cannot be controlled, and that the overall complexity of the representation can be overwhelming,
and excessively screen real estate. Another system, Flamenco [Hearst et al. 2002], has been built specifically
for complex, multi-ontology queries, and enables top-down browsing of multiple ontologies simultaneously. Each
ontology is displayed in a box, with a ‘breadcrumb trail’ leading to the current term (enabling users to jump
back up the hierarchy), and a two-column list of children terms, with controls to expand or contract each box
dynamically. In this way, a large space of information is made accessible to the user without overwhelming
them or consuming screen space. Digr will use a similar approach, with adaptations for the display of DAGs,
as Flamenco was designed for tree-structured ontologies. In this way, users can decide if more specific, or more
general terms than the ones they have chosen, best capture their query.
Another approach to finding ontology terms is text searching. An example of this is Ontology Lookup Service
(OLS) [Cˆot´e et al. 2006], which uses a support vector machine (SVM) approach to enable extremely fuzzy text
searching with ranked results. OLS, however, only provides the names of matches for a user to choose from.
A richer, more informative view would better assist users in finding terms matching the concept they had in
mind. Flamenco also uses text searching, displaying results grouped by ontology terms, with options for choosing
different ontologies to group by. Digr will attempt to use the fuzzy search capabilities of OLS, but integrated
Proceedings of SABioinf 2007
4. 114 • Kieran O’Neill et al.
with the top-down components of the interface to provide a richer view of results.
Another useful technique in complex query construction is the provision of dynamic query previews, as carried
out in Flamenco [Hearst et al. 2002]. As the query is constructed, only the terms which can be combined with
them are offered as choices to refine the query with. This reduces the complexity of finding terms, and can help
to show implicit relationships between them. In addition, the size of the result set to be returned is displayed
next to each combinable term, thus providing users with additional information to aid query construction. Digr
will attempt to provide both of these facilities.
As an interface for complex query building, Digr aims to integrate text searching and top-down browsing in a
single interface, with dynamic query previewing. In addition, results will be visually clustered according to their
annotations to different ontologies, and all user actions will be undoable via a history-keeping mechanism. In
these ways, Digr hopes to visually aid biologists in constructing complex biological queries.
REFERENCES
Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J.,
Harris, M., Hill, D., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G.,
and Sherlock, G. 2000. Gene ontology: tool for the unification of biology. Nat Genet. 25(1), 25–29.
Blake, J., Eppig, J., Bult, C., Kadin, J., Richardson, J., et al. 2006. The mouse genome database (mgd): updates and
enhancements. Nucleic Acids Research 34, Database Issue.
Camon, E. et al. 2004. The gene ontology annotation(goa) database: sharing knowledge in uniprot with gene ontology. Nucleic
Acids Research 32, 90001, 262–266.
Card, S., Mackinlay, J., and Shneiderman, B. 1999. Readings in Information Visualization: Using Vision to Think. Morgan
Kaufmann.
Cˆot´e, R., Jones, P., Apweiler, R., and Hermjakob, H. 2006. The ontology lookup service, a lightweight cross-platform tool for
controlled vocabulary queries. BMC Bioinformatics 2006, 7, 97.
Drysdale, R., Crosby, M., Gelbart, W., Campbell, K., Emmert, D., et al. 2005. Flybase: genes and gene models. Nucleic
Acids Res 33, 390–395.
Etzold, T. and Argos, P. 1992. Srs–an indexing and retrieval tool for flat file data libraries. Bioinformatics 9, 49–57.
Galperin, M. Y. 2006. The molecular biology database collection: 2006 update. Nucl. Acids. Res. 34, D3–D5.
Garcia-Castro, A., Chen, Y., and Ragan, M. 2005. Information integration in molecular bioscience. Appl Bioinformatics 4, 3,
157–173.
Garcia-Castro, A., Thoraval, S., Garcia, L., and Ragan, M. 2005. Workflows in bioinformatics: meta-analysis and prototype
implementation of a workflow generator. BMC Bioinformatics 6, 87.
Goble, C., Stevens, R., Ng, G., Bechhofer, S., Paton, N., Baker, P., Peim, M., and Brass, A. 2001. Transparent access to
multiple bioinformatics information sources. IBM Systems Journal 40, 2, 532–551.
Hearst, M., Elliott, A., English, J., Sinha, R., Swearingen, K., and Yee, K. 2002. Finding the flow in web site search.
Communications of the ACM 45, 9, 42–49.
Kanz, C., Aldebert, P., Althorpe, N., Baker, W., Baldwin, A., Bates, K., Browne, P., van den Broek, A., Castro, M.,
Cochrane, G., et al. 2005. The embl nucleotide sequence database. Nucleic Acids Res 33, 167–172.
Lee, Y., Tsai, J., Sunkara, S., Karamycheva, S., Pertea, G., Sultana, R., Antonescu, V., Chan, A., Cheung, F., and
Quackenbush, J. 2005. The tigr gene indices: clustering and assembling est and known genes and integration with eukaryotic
genomes. Nucleic Acids Res 33, 71–74.
Letondal, C. 2001a. Interaction et programmation - conception d’applications programmables avec des non-informaticiens. Ph.D.
thesis, Universit de Paris-Sud.
Letondal, C. 2001b. A web interface generator for molecular biology programs in unix. Bioinformatics 17, 1, 73–82.
Otgaar, D., Dominy, D., Maclear, A., Gamieldien, J., Martinez, F., and Jacobson, D. 2006. Digrabase: A graph-theoretic
framework for semantic integration of biological data. Poster, Joint BioLINK and 9th Bio-Ontologies Meeting.
Schuler, G., Epstein, J., Ohkawa, H., and Kans, J. 1996. Entrez: molecular biology database and retrieval system. Methods
Enzymol 266, 141–62.
Schwarz, E., Antoshechkin, I., Bastiani, C., Bieri, T., Blasiar, D., Canaran, P., Chan, J., Chen, N., Chen, W., Davis, P.,
et al. 2006. Wormbase: better software, richer content. Nucleic Acids Research.
Searls, D. 2005. Data integration: challenges for drug discovery. Nat Rev Drug Discov 4, 1, 45–58.
Shneiderman, B. 1996. The eyes have it: a task by data type taxonomy for informationvisualizations. Visual Languages, 1996.
Proceedings., IEEE Symposium on, 336–343.
Smith, C., Goldsmith, C., and Eppig, J. 2005. The mammalian phenotype ontology as a tool for annotating, analyzing and
comparing phenotypic information. Genome Biol 6, 1, R7.
Tao, Y., Liu, Y., Friedman, C., and Lussier, Y. 2004. Information visualization techniques in bioinformatics during the postge-
nomic era. Drug Discovery Today BIOSILICO 2, 237–245.
Walenstein, A. 2002. Cognitive support in software engineering tools: A distributed cognition framework. Ph.D. thesis, SIMON
FRASER UNIVERSITY.
Wong, L. 2002. Technologies for integrating biological data. Briefings in Bioinformatics 3, 4, 389–404.
Proceedings of SABioinf 2007