Syst biol 2012-burguiere-sysbio sys069

Copyedited by: GS MANUSCRIPT CATEGORY: Article
[13:42 7/9/2012 Sysbio-sys069.tex] Page: 1 1–5
Software for Systematics and Evolution
Syst. Biol. 0(0):1–5, 2012
© The Author(s) 2012. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved.
For Permissions, please email: journals.permissions@oup.com
DOI:10.1093/sysbio/sys069
IKey+: A New Single-Access Key Generation Web Service
THOMAS BURGUIERE∗, FLORIAN CAUSSE, VISOTHEARY UNG, AND RÉGINE VIGNES-LEBBE
Laboratoire Informatique et Systématique, Museum National d’Histoire Naturelle, Département Histoire de la Terre,
UMR 7207 – C.N.R.S. - M.N.H.N. - U.P.M.C., Paris, France
∗Correspondence to be sent to: Thomas Burguiere, Laboratoire Informatique et Systématique, M.N.H.N Département Histoire de la Terre,
Bâtiment de Géologie, CP48, 57 rue Cuvier, 75005 Paris, France,
E-mail: thomas.burguiere@upmc.fr
Thomas Burguiere and Florian Causse contributed equally to this article.
Received 2 April 2012; reviews returned 29 June 2012; accepted 25 July 2012
Associate Editor: David Posada
Abstract.—Single-access keys are a major tool for biologists who need to identify specimens. The construction process of these
keys is particularly complex (especially if the input data set is large) so having an automatic single-access key generation tool
is essential. As part of the European project ViBRANT, our aim was to develop such a tool as a web service, thus allowing
end-users to integrate it directly into their workflow.
IKey+generates single-access keys on demand, for single users or research institutions. It receives user input data (using
the standard SDD format), accepts several key-generation parameters (affecting the key topology and representation), and
supports several output formats.
IKey+ is freely available (sources and binary packages) at www.identificationkey.fr. Furthermore, it is deployed on our
server and can be queried (for testing purposes) through a simple web client also available at www.identificationkey.fr
(last accessed 13 August 2012). Finally, a client plugin will be integrated to the Scratchpads biodiversity networking tool
(scratchpads.eu). [Systematics; taxonomy; single-access key; web service; biodiversity informatics.]
During the last decade, taxonomy has been going
through a revolution towards cybertaxonomy (Wheeler
2008). This is particularly obvious when looking at
the numerous initiatives dedicated to the management
and monitoring of biodiversity data such as Global
Biodiversity Information Facility (GBIF), European
Distributed Institute of Taxonomy (EDIT), BIOTA,
Catalogue of Life, Encyclopedia of Life (EOL) or Key
To Nature, and now ViBRANT. A significant increase
of digitized information should be expected in a near
future, and the deep social and scientific impact of web
2.0 stimulates sharing of digital data and proposals
of cybertaxonomy projects to study global biodiversity
and climate changes. In the current context of
distributed knowledge and long-distance collaborations,
the European project ViBRANT (http://vbrant.eu, last
accessed 13 August 2012) has emerged to bridge the
existing gap between tools and services, providing a
powerful exemplar platform, the Scratchpads, for the
entire community.
ViBRANT is about connecting the people, data and science of
biodiversity (ViBRANT - Objectives)
As studying and monitoring biodiversity remains
a strong challenge for biologists, descriptive data
management and identification tools are needed to assist
them in their daily work. For instance, we developed a
complete descriptive data management tool: Xper2 (Ung
et al. 2010).
Our contribution to ViBRANT is the implementation
of an identification key generation web service, because
we also have a significant experience in developing key
generation tools such as MyKey (Gérard and Vignes-
Lebbe 2010). The most common identification keys
(described in Hagedorn et al. 2010) are the printed
single-access keys published in almost all taxonomic
revisionary work. But these printed keys may be difficult
to use while collecting samples in the field or when
consulting incomplete specimens. Hence, electronic
single-access keys, which can be quickly generated
with different parameters in order to bypass potentially
problematic steps (i.e., steps with characters for which
the specimen studied in the field cannot provide
information) are a good alternative.
APPROACH
Web Service
There are currently several tools designed to generate
single-access keys such as, among others, the Delta
software suite (Dallwitz et al. 1993), Pankey (Pankhurst
1988), and MyKey. These tools are well established and
produce valid identification keys, but are not without
drawbacks. For instance, both Delta and Pankey need to
be installed on the end-user’s machine and are not cross-
platform compatible (Windows, Apple, and Linux).
Mykey is a web application, which makes it platform-
agnostic, but the end-users cannot submit their own data
set to generate a key (the data set has to be manually
uploaded by someone in our laboratory). Furthermore,
these 3 tools do not use the biodiversity description
standard file format, Structured Descriptive Data (SDD)
(Hagedorn 2006), for data input.
Our objective, within the ViBRANT project, was to
design a free and open-source key generation service
using the SDD file format that can be used by
independent users or through the ViBRANT OBOE
service (also available through Scratchpads). It was
thus decided to develop a web service (implementing
both the SOAP and REST web service communication
1
Systematic Biology Advance Access published September 16, 2012
atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom

2 SYSTEMATIC BIOLOGY
FIGURE 1. IKey+use cases.
protocols), which is particularly suitable for these two
usecases(cf.Fig.1)andallowstheclienttobuildcomplex
workflows, with a high level of automation.
Architecture
Because the main purpose of ViBRANT is to provide
a platform that is both open source and easily reusable,
we organized IKey+ in two parts:
• an Application Programming Interface (API)
consisting of 3 distinct modules; the Data Model,
that is, the computer representation of the
descriptive data, the Input-Output (IO) module,
whichparsestheSDDinputfilesandloadsthedata
in the Model and generates the output files, and
the Algorithm, which uses the data contained in
the Model to generate a key and returns it through
the IO module.
• a Simple Object Access Protocol (SOAP) or a
REpresentational State Transfer (REST) Service
Layer encapsulating the API, which manages the
communication between the client and the API.
This allows potential developers to integrate the service
into their workflows and adapt it to their specific needs
(e.g., integrating the API in a standalone software). The
whole application was developed using the J2EE (Java 2,
Enterprise Edition) programming environment.
Web Service Input Parameters
IKey+ has several parameters available to the end-
user. These parameters allow the end-user to change the
topology of the key (e.g., pruning the key, promoting
some characters), change the visual aspect of the key, or
the output format. A complete list of these parameters
can be found in the user documentation, available here:
http://www.identificationkey.fr/resources/docs/ident
ificationKeyGeneratorWS_UserGuide.pdf (last accessed
13 August 2012).
Algorithm
The single-access key generation algorithm is a
recursive depth-first graph construction algorithm. Its
arguments are a list of taxa, taxaList and the list
of character considered, charList (cf. Supplementary
Materials, appendices 1, 2 and 3, available at http://
datadryad.org, doi:10.5061/dryad.3ft19). The resulting
graph (cf. Fig. 2) is a directed acyclic graph, consisting
of non-terminal nodes labelled with a character, edges
labelled with the states of the character of the previous
node, and terminal nodes labelled with taxa. At each
step of the algorithm, the best character among those
available is selected by the BEST_CHAR function, which
iterates over the available characters and returns the
character with the greatest discriminant power.
Discriminant Power of a Character
During the last 50 years, estimating a character’s
discriminant power has been the central point of key-
construction algorithms. Many measurement methods
have been suggested, and a review of these methods can
be found in Gower and Payne (1975), Pankhurst (1991),
and Delgado-Calvo-Flores et al. (2006). In a statistical
context, examples include the Bayesian probability and
generalized entropy, such as the Shannon entropy used
in the ID3 algorithm developed by Quinlan (1986) or the
Gini index used in Breiman et al. (1984). In a context
where there is no probability associated with the states
of the characters, one can use the separation coefficient
or the variance as measurement methods of a character’s
discriminant power.
In IKey+, the discriminant power of a given character
C is calculated by the DPOWER function and is
an estimate of C’s ability to differentiate the taxa
of the current list. DPOWER is an extension of the
Gyllenberg separation factor (Gyllenberg 1963) and
is a measurement of the number of pairs of taxa
discriminated by C, which is a generalization of the
variance. Indeed, the variance of a variable can be
computed by comparing all pairs of values (here,
comparing all pairs of taxa description for C). With
different comparison functions, each adapted to a certain
type of character (e.g., a categorical or a numerical
character), it is possible to obtain a generalized formula
to estimate the discriminant power of a given character,
even with polymorphic characters or characters with
missing data. For categorical characters, DPOWER can
use a binary function (boolean comparison) or other

2012 BURGUIERE ET AL.—Ikey+ 3
Character I
Character II
State I-a
Taxon 1
State I-b
Character III
State II-a
Taxon 2
State II-b
Character IV
State II-c OR State II-d
Taxon 3
State III-a
Taxon 4
State III-b
Taxon 5
State IV-a
Taxon 6, Taxon 7
State IV-b
FIGURE 2. Formal identification key example.
functions such as the Sokal and Michener coefficient
(Sokal and Michener 1958) or the Jaccard coefficient
(Jaccard 1901). For numerical characters (e.g., a size
measured in milllimetre), the set of values (i.e., all the
values entered for C for all taxa) is split in two intervals.
In order to determine the threshold value separating
these two intervals, we consider the list of min and max
value of C for each of the remaining taxa. We then choose
from this list the value that separates the remaining
taxa into two groups of equal size (±1 taxon). These
two intervals are considered as 2 discrete states for the
calculation of the discriminant power.
DPOWER iterates over the available taxa (i.e., those
that are compatible with the current description), and
determines, for each pair of taxa Ta and Tb), the
dissimilarity(i.e.,thepossibilityofdiscriminatingTa and
Tb with C) that is based on the number of common states
of C for Ta and Tb (n11), the number of states of C that
occur only for Ta ( n10), the number of states of C that
occur only for Tb (n01), and the number of states of C that
do not occur for neither Ta or Tb (n00). The discriminant
power is then calculated as
DP=
a b
SCORE(n11,n10,n01,n00).
Three different SCORE measurements are currently
available in IKey+, the Xper coefficient (Ung et al. 2010;
Vignes et al. 1989), the Jaccard coefficient, and the Sokal
and Michener coefficient.
BENCHMARKS
Tests
In order to assess the performance of our web
service, we conducted a series of tests using the
same input file for every test. This file contains a
data set representing the subfamily Cichorieae of the
plant family Asteraceae with 303 observable characters,
144 taxa, and their descriptions. This data set was
generated and curated as an exemplar group for the
EDIT project (Hand et al. 2009) [the file is available
here: http://www.infosyslab.fr/vibrant/project/test/
Cichorieae-fullSDD.xml (last accessed 13 August 2012)].
The aim of these tests was to evaluate the overall
performance of the algorithm, to assess the impact
of the various parameters available to the end-user
on performance, and to ensure that IKey+ would be
robust in a production environment. We measured the
time necessary for the web service to respond to 100
identification keys generation queries, while counting
the number of rejected queries due to CPU overload
(IKey+ rejects any query received when the CPU load
of the host server is greater than 80%). We ran one
reference test (described in Supplementary Materials,
appendix 4, doi:10.5061/dryad.3ft19), and several tests
with variations on the input parameters, the web service
communication protocol used or the parallelization
setups. The complete list of performance test setups
is available in Supplementary Materials, appendix 5,
doi:10.5061/dryad.3ft19. We also measured the time
necessarytogenerateasinglekey,usingthereferencetest
configuration described in Supplementary Materials,
appendix 4, doi:10.5061/dryad.3ft19. Finally, we tested
the usability of IKey+ from an end-user perspective,
when generating the Cichorieae identification key using
the web interface. We asked 11 persons who were not
involved in the development of IKey+ to test the web
service and the web interface. They were given the
SDD formated Cichorieae data set and were asked to
use the web interface to create the identification key.
We measured the time needed by each test subject to
generate the identification key.
Results
The average length of the paths leading to taxa
that were identified by the algorithm is 4.67 steps,
with the shortest path being 1 step long, and the
longest being 10 steps long. The generation of a single

4 SYSTEMATIC BIOLOGY
key took roughly 1.8 s. The results of the other tests
are shown in Supplementary Materials, appendix 6,
doi:10.5061/dryad.3ft19. The reference test took roughly
50 s to complete, that is, 500 ms per query, which is
consistent with the time measured for the generation
of a single key, because the reference test uses 4
simultaneous threads to generate the keys. In this test,
few queries were rejected due to CPU overload. Among
the parameters available to the end-user, only the score
method parameter had a significant influence: when
using the Sokal and Michener score method or the
Jaccard score method, the tests took longer to finish
(∼80 s instead of 50 s). The score method parameter had
no impact on the number of rejected queries. Our tests
showed that the communication protocol used (SOAP or
REST) had no influence on the performance of IKey+.
Some taxa do not appear in the generated key, due to
insufficient data in the input file. Indeed, the Cichorieae
data set we used for our tests was created by an
external team. This is because some errors were made
during creation of the Cichorieae data set: some taxa
with unknown data were not specifically marked as
“unknown data”, but were left unspecified instead.
Regardless, although these ambiguously coded taxa do
not appear in the resulting identification key by default,
the end-user can choose to have them appear in the key.
When modifying the parallelization of the queries,
we observed significant performance variations. As can
be expected, when launching 100 queries sequentially
(instead of using 4 simultaneous threads launching 25
queries each), the test was 4 times longer, and no queries
were rejected. When we augmented the parallelism
of the queries (e.g., 25 or 100 simultaneous queries),
more queries were rejected (up to 50%). However, when
launching 100 simultaneous queries, with a random
delay at the beginning of each thread, the results (both in
time and number of rejected queries) were comparable to
the performance of the reference test. In the usability test,
the average time needed to generate the identification
key was slightly above 60 s (60.36 s), with the shortest
time measured at 26 s, and the longest time measured
at 121 s. The complete results of the usability test
are available in Supplementary Materials, appendix 7,
doi:10.5061/dryad.3ft19.
DISCUSSION
IKey+ is available and can be installed on any J2EE
application server (e.g., Apache Tomcat). It can generate
single-access keys using a tree or a flat representation
in several output formats (HTML, Wiki, SDD, plain-
text, etc.).
Our test showed that IKey+ performs sufficiently well
to handle a large data set in a relatively short amount of
time and can generate well-optimized key files (average
number of steps to identify a taxon: 4.67). This, combined
with the web service accessibility, makes it possible to
integrate IKey+ in many workflows that might require
a fast and automated key generation process (e.g., a batch
key generation script). Furthermore, an end-user can use
the web interface available at www.identificationkey.fr
to quickly generate a customized identification key,
using the numerous parameters available (affecting the
topology, representation, file format, etc.).
Finally, our tests showed that IKey+ is likely to
be robust in a production environment (i.e., many
simultaneous queries) as it is able to withstand
simultaneous key generation queries (e.g., 4 threads
launching 25 consecutive queries). It is also protected
from cryptic failure, because we implemented a CPU-
load-watching mechanism that automatically rejects a
query (with an explicit error message) whenever the
CPU load exceeds a given threshold (80%). This prevents
a crash of the service, or the generation of incomplete or
corrupt key files.
CONCLUSION
IKey+ is the first key-generation tool available as
a web service with standardized input and output
formats. Our test showed that IKey+ is able to generate
keys rapidly and that it can also be used by an end-user
with the web interface. Finally, the modular and open-
source nature of IKey+ makes it possible for anyone to
reuseitscomponents.Forinstance,weplantoreusesome
components of the API to develop another web service
that would provide free-access key identification.
LICENSING
As part of the ViBRANT project, IKey+’s source
code is freely available and is licensed under the GNU
General Public License version 2. It is already available
on our google code SVN repository: http://ikey-plus.
googlecode.com/svn/trunk/ (last accessed 13 August
2012). It will be actively maintained by our team for the
next 2 years.
SUPPLEMENTARY MATERIAL
Supplementary material, including Algorithms and
appendices, can be found at http://www.sysbio
.oxfordjournals.org.
FUNDING
This work was supported by the European Union
funded FP7 ViBRANT Project (Contract number RI-
261532, Period, December 2010 to November 2013).
ACKNOWLEDGEMENTS
We sincerely thank Gregor Hagedorn (Julius Kühn
Institute, Berlin, Germany) and Andreas Müller
(Botanical Garden and Botanical Museum, Berlin,

2012 BURGUIERE ET AL.—Ikey+ 5
Germany) for sharing their knowledge on the SDD
format. We are also grateful to Dave Roberts (Natural
History Museum, London, UK) for reviewing an early
version of the article and providing style improvements.
REFERENCES
BIOTA. Available from: URL http://www.edinburgh.ceh.ac.uk/
biota/ (last accessed 13 August 2012).
Breiman L., Friedman J.H., Olshen R.A. Stone C.J. 1984. Classification
and regression trees. Belmont, CA:Wadsworth International Group.
Catalogue of Life. Available from: URL http://www.catalogueoflife.
org/ (last accessed 13 August 2012).
Dallwitz M.J., Paine T.A., Zurcher E.J. 1993. User’s guide to the delta
system: a general system for coding taxonomic descriptions. 4th
ed. Available from: URL http://delta-intkey.com (last accessed
13 August 2012).
Delgado-Calvo-Flores M., Fajardo-Contreras W., Gibaja-Galindo E.L.,
Perez-Perez R. 2006. Xkey: a tool for the generation of identification
keys. Expert Syst. Appl. 30:337–351.
EDIT. European Distributed Institute of Taxonomy. Available from:
URL http://www.e-taxonomy.eu/ (last accessed 13 August 2012).
EOL. Encyclopaedia of Life. Available from: URL http://eol.org/.
GBIF. Global Biodiversity Information Facility. Available from: URL
http://www.gbif.org/ (last accessed 13 August 2012).
Gérard D., Vignes-Lebbe R. 2010. Mykey: a server-side software to
create customized decision trees. In: Nimis P.L., Vignes-Lebbe R.,
editors. Tools for identifying biodiversity: progress and problems.
Edizioni Università di Trieste, Trieste, Italy. p. 107–112.
Gower J.C., Payne R.W. 1975. A comparison of different criteria
for selecting binary tests in diagnostic keys. Biometrika
62:665–672.
Gyllenberg H.G. 1963. A general method for deriving determinative
schemes for random collections of microbial isolates. Ann. Acad.
Scient. Fenn. Ser. A IV. Biologica 1(69):1–23.
Hagedorn G. 2006. The structured descriptive data (SDD) w3c-xml-
schema. Version 1.1 Available from: URL http://wiki.tdwg.org/
twiki/bin/view/SDD/Version1dot1 (last accessed 13 August 2012).
Hagedorn G., Rambold G., Martellos S. 2010. Types of identification
keys. In: Nimis P.L., Vignes-Lebbe R. editors. Tools for identifying
biodiversity: progress and problems. Edizioni Università di Trieste,
Trieste, Italy. p. 59–64.
Hand R., Kilian N., Raab-Straube E. 2009. International cichorieae
network: Cichorieae portal. Available from: URL http://wp6-
cichorieae.e-taxonomy.eu/portal/ (last accessed 13 August 2012).
Jaccard P. 1901. Étude comparative de la distribution florale dans une
portion des alpes et des jura. Bull. Soc. Vaud. Sci. Nat. 37:547–579.
J2EE. Java 2, Enterprise Edition. Available from: URL http://www.
oracle.com/technetwork/java/javaee/overview/index.html (last
accessed 13 August 2012).
KeyToNature. Available from: URL http://www.keytonature.
eu/wiki/ (last accessed 13 August 2012).
Pankhurst R.J. 1988. Pankey programs. DELTA Newsletter 1:2.
Pankhurst R.J. 1991. Practical taxonomic computing. Cambridge
University Press, Cambridge, UK.
Quinlan J.R. 1986. Induction of decision trees. Mach. Learn. 1:81–
106. ISSN 0885-6125. Available from: URL http://dx.doi.org/
10.1007/BF00116251 (last accessed 13 August 2012).
REST Architecture. Available from: URL http://www.oracle.com/
technetwork/articles/javase/index-137171.html (last accessed
13 August 2012).
Scratchpads. Biodiversity Online. Available from: URL http://
scratchpads.eu/ (last accessed 13 August 2012).
SOAP. Simple Object Access Protocol, W3C Recommandation. Version
1.2. Available from: URL http://www.w3.org/TR/soap/ (last
accessed 13 August 2012).
Sokal R., Michener C. 1958. A statistical method for evaluating
systematic relationships. Univ. Kansas Sci. Bull., (38):1409–1438.
Ung V., Dubus G., Zaragüeta-Bagils R., Vignes-Lebbe R. 2010. Xper2:
introducing e-taxonomy. Bioinformatics 26(5):703–704.
ViBRANTa. Objectives. Available from: URL http://vbrant.eu/
node/1 (last accessed 13 August 2012).
ViBRANTb. Virtual Biodiversity Research and Access Network for
Taxonomy. Available from: URL http://vbrant.eu (last accessed
13 August 2012).
Vignes R., Lebbe J., Darmoni S. 1989. Symbolic-numeric approach
for biological knowledge representation: a medical example with
creation of identification graphs. In E. Diday, editor, Proceedings
of the conference on Data analysis, learning symbolic and numeric
knowledge.NovaSciencePublishers,Inc.Commack,NY,USA.ISBN
0-941743-64-0. p. 389–398.
Wheeler, Q.D., editor 2008. The new taxonomy. CRC Press Inc. New
York, USA.

Syst biol 2012-burguiere-sysbio sys069

Recommended

Recommended

More Related Content

Similar to Syst biol 2012-burguiere-sysbio sys069

Similar to Syst biol 2012-burguiere-sysbio sys069 (20)

Recently uploaded

Recently uploaded (20)

Syst biol 2012-burguiere-sysbio sys069