This document summarizes a case study of reusable software components for information retrieval. It discusses the development, distribution, use and evolution of the components. The components were developed in C to accompany a book on data structures and algorithms for information retrieval systems. Practical issues that arose included choosing the implementation language, distributing source code versus binaries, testing and optimizing components, different delivery methods, legal ownership, maintenance and configuration management, searching for components and understanding how to use them.
1. A Case Study of a Reusable Component Collection
William B. Frakes
Computer Science Department
Virginia Tech, Falls Church
wfrakes@vt.edu
Abstract
This paper reports on practical issues in the
development, distribution, use, and evolution of a
reusable component collection in the domain of
information retrieval.
1. Introduction
Software reuse is the use of existing software
knowledge or artifacts to build new software. There are
many types of software reuse [9]. The reuse described in
this paper is ad-hoc, black box, compositional, code
reuse. Ad hoc means that the reuse is not part of a
repeatable mandated organizational process. Ad hoc reuse
is by far more common than systematic reuse, though the
latter is thought to be more powerful. Black box reuse is
reuse of a software item without modification.
Compositional reuse means that the software system was
built by a human programmer out of components, as
opposed to generating a system automatically from
specifications. The reuse described in this paper is
primarily vertical rather than horizontal since it is focused
in the domain of information retrieval, though some of
the components such as string searching might also be
considered horizontal.
One source of reusable software is the code that is
developed to accompany books. This paper concerns code
from a book on data structures and algorithms for
information retrieval (IR) systems [6]. Information
retrieval systems retrieve textual documents from
a database in response to queries submitted to
the system by users.
IR systems can be defined more formally using set and
function notation as follows.
D = set of textual documents
D’=subset of D
Q=set of queries
M=matching function
Systems in the domain of information retrieval can
now be specified as follows.
S : S computes D’=M(D,Q)
That is, all systems S such that S returns a subset of
documents D’ of D that match the set of queries Q are IR
systems.
One of the goals for the book was development of
reusable IR code. Authors were asked to develop software
components for their chapters in C following industrial
coding guidelines. This was partly successful, and with
some rework, the following components were developed
and tested:
• Lexical Analysis and Stop List operations - this code
breaks text into words and removes words considered
unimportant for indexing.
• Stemmer Code - implements the Porter stemming
algorithm. A stemmer conflates words by finding a
common root form of the words.
• Thesaurus Construction - supports the automatic
construction of thesauri from source text.
• Boolean Operations - implements standard Boolean
operations (AND, OR, NOT) on sets of documents.
• Hashing Algorithms - including an algorithm for
minimal perfect hashing.
• String searching - implementations for basic
algorithms for finding patterns in text strings.
This paper is about practical issues encountered in the
creation, distribution, and use of the components. These
issues are not particular to the domain of information
retrieval, nor particular to C functions. They may well
arise in any domain and for any type of reusable asset.
0-7695-0559-7/00 $10.00 ĂŁ 2000 IEEE
2. 2. What is a component?
What is a component? The term is ambiguous. A
component can be any lifecycle object or part thereof.
Usually a code component is a subroutine (function or
subprogram), or an object or class, but it could also be
many other things like macros, header files, subsystems,
processes, or patterns. This paper discusses collections of
C functions. This simplifies things a little since this is a
kind of reuse familiar to many. Even this kind of reuse,
however, can still be complicated.
The 3 C’s model of reuse design [12] says that there
are three aspects of a reusable component--the concept, the
content, and the context. The concept corresponds to the
abstract functionality of a component such as might be
specified in an abstract data type or a formal algorithm
specification. The purpose of such abstractions is to focus
on the essence of the component, whatever that might be,
and ignore other details, usually implementation details.
The chapters in the book provide the specifications for the
concepts of the components.
The content is the implementation of the component.
This involves selection of a programming language and a
design. The implementations of the components in C are
the content. The transition from concept to content
involves moving from the problem, or domain, space to
the solution space. The problem space is only concerned
with the concepts and operations of the domain in
question--in this case information retrieval. The solution
space involves the concepts and operations of the
implementation environment--in this case the C language.
The context is the environment needed to use the
components. Context for code components might be the
required machine, operating systems, compiler version,
and so on. The code for the IR components was
developed for and tested on a Unix system and certain
assumptions were made regarding implementation.
Porting the code to DOS, for example, required changes
to make filenames have the required length of no more
than eight characters.
3. Language
Software reuse is now generally regarded to be a good
thing, and most modern languages make some claim for
their support of reuse. The C language, for example, was
designed for extensive reuse in the sense that it is a small
language extensively augmented by reusable function
libraries. Newer languages like C++ provide reuse of
higher level programming constructs such as objects,
classes, and templates and directly supports type
polymorphism via function overloading. A summary of
the reuse aspects of C++, for example, can be found in
[14].
I selected C as the component implementation
language because C was and is a widely known and used
language in both industry and academia. It is also the
programming language I know best, and the one I’ve used
to develop industrial software. There are also many good
free software engineering support tools for C, including
free compilers. Was C the best choice? This of course
opens the door to language lawyering. Let me just say
that the components got developed.
Some of the components have been rewritten in other
languages, sometimes with attribution of the source,
sometimes not. Versions of the stemmer, for example,
have appeared on the web in Perl and Java.
4. Source or Binary?
The argument is sometimes made that only the
executable code for reusable components should be
distributed, not the source code. The reasoning here is
that distributing source code means that it will be
modified which will break the design abstraction, thus
losing much of the reuse benefit. Executable distribution
could be done in C by making and distributing archive
files containing object code for the functions. This
assumes that all of the users will have an environment
where the archives can be used.
Distributing only executable code may be a good idea
if the user of the components can be assured that someone
is available to fix problems and make enhancements as
needed. With software such as the IR components there
was no readily available maintenance organization, so we
distributed the source.
5. Testing and Optimization
The quality assurance of software is important to its
reuse. Code that does not meet the software quality
standards of an organization will not be reused by the
organization. Inside an organization, thorough testing and
optimization of components can be justified since the
higher costs for these activities can be amortized across
the multiple reuses.
Before release, the code was inspected for conformance
to programming standards such as the use of standard
headers on code modules and so on, and run through lint,
and coverage analyzed to 90% branch coverage. Code
portability was checked by moving the code to another
environment.
0-7695-0559-7/00 $10.00 ĂŁ 2000 IEEE
3. A rule of thumb sometimes used by designers of
reusable components is that if the reusable components is
more than 25% slower than an equivalent one use
component, it will not be reused. Optimization of code
components can, therefore, be important. Optimizations
must be done carefully, since increased optimization often
decreases code readability and maintainability. Bentley
provides a good summary of proper techniques[3]. For the
IR components, however, no systematic optimization was
done, nor have there been any requests for it from users.
6. Delivery Methods
A key question with a component collection is how to
make the components available? The first plan for the IR
components was for a disk to be included with the book,
but for various logistical reasons that didn’t work. So,
plan two was to make the code available via ftp. I put the
code for each chapter in a separate directory at a Virginia
Tech ftp site (ftp.vt.edu). I originally stored the code for
each sub-collection in a separate directory. I started
getting requests to put the code in a single file to make
downloading easier. I did that by creating a compressed
tar file and putting that on the web site. Then I started
getting email from people outside the U.S. saying that
they couldn’t get into the ftp site. I referred them to the
ftp site technician. I think that people usually got the
code they wanted, but the problem persisted. I decided to
put the components into software repositories as well.
In the 1990’s the U.S. government started supporting
research and development of reuse repositories. Two such
were Asset and Mountainnet. I submitted the IR
components to both libraries. Submission of components
to the library required that I fill out a template describing
the components. The components were available in these
repositories for several years. Government funding for the
repositories was stopped in 1998, and the repositories are
now no longer available.
In 1994, Prentice-Hall licensed the book to Dr. Dobbs
who created a CD-Rom containing the IR book, and
several other algorithms books [5]. The text of the book,
code included, was put into a hypertext format and a
search engine was included.
Many other web sites now either reference the IR code
ftp site, or keep a copy of the code. There is, however, no
mechanism for keeping consistency among web sites
offering the code. This is a version control problem (see
maintenance section below).
GNU (Gnu’s not Unix) is a collection of software
managed by the Free Software Foundation[10]. While
examining the holdings of the GNU library, I saw they
had nothing on IR. I contacted the Free Software
Foundation offering the IR components. After several
email exchanges, the following facts emerged.
1. GNU would like to have the code.
2. Some rework of the code would be necessary to put
the code into the GNU format.
3. Having the code in GNU would require a
commitment to long term maintenance (see
discussion of maintenance below).
4. Putting the code in GNU would require that the code
meet the GNU standards for free software. This
requires, among other things, that the code in the
GNU library make no reference to the book, and that
the code be freely available for modification by any
user. This raised many copyright and other legal
issues that have not yet been resolved.
7. Legal Ownership
Legal ownership of components is concerned with
three types of legal claims: copyright, patents, and trade
secrets[11]. A copyright protects the expression of an
idea. Copyright has traditionally been used to protect
books and other print material, and music. Current
copyright law allows copying of software for backup and
archival purposes. Copyright protection is relatively
inexpensive and easy to obtain. Copyright claims need
not be formally filed, though failure to do so may limit
legal claims.
There has been some work on assuring versions of
software using encryption methods [13]. In this scheme,
each component would be assigned a unique identifier.
Once published, the component could not be changed
even by the author without changing the identifier. This
method might also be used to protecting copyrighted
software components. Collberg and Thomborsen describe
a method called watermarking for embedding a secret tag
in a component that can be used to uniquely identify the
component, and therefore to tell if it has been stolen[4].
A patent protects an idea, rather than the expression of
the idea. Current patent law restricts others from using the
patented idea for seventeen years after the patent is
granted. Software, algorithms, and processes are typically
patented rather than copyrighted. Obtaining a patent is a
long expensive process, involving an extensive search to
determine if the patent is original. Patents are granted by
government agencies such as the U.S. patent office. Over
20,000 software patents were issued from 1994-96 [1].
Twenty-nine of ninety two respondents to a survey on
software reuse agreed at least somewhat with that they
were inhibited from reusing software by legal issues [7].
Legal issues, unfortunately, are likely to grow in
importance as reuse crosses organizational boundaries and
0-7695-0559-7/00 $10.00 ĂŁ 2000 IEEE
4. moves into the open marketplace. Our experience with
GNU and with the user who wanted a legal document
giving him the right to use the IR components (see usage
section below) reinforces this point.
8. Maintenance and Configuration
Management
Perhaps the most difficult problems about the
components concern maintenance and configuration
management. Maintenance is expensive. Maintenance
costs can easily exceed half of the total costs for a
software project, and numbers for reusable component
collections are probably similar. Code contributors
usually do not want to be responsible for maintenance, so
component collections like the IR components usually do
not have adequate maintenance support. In this section,
the main issues of maintenance are briefly reviewed.
Software configuration management is about how to
monitor and control changes to software, in this case
reusable software assets. Versions of reusable assets must
also be coordinated with other software lifecycle items to
produce correct and consistent product releases.
Configuration management has three major activities:
• Version control. Reusable software components, like
any software product, will have versions because of error
fixes and enhancements. To build a system using these
assets, one will need to know which version to use. Old
versions of assets must be recoverable for reference, and
so they can be used to make corrections and
enhancements. As software assets change, they form
successive versions. Version control is the activity of
keeping track of these versions. To handle this problem,
the IR components were put under change control using
SCCS (source code control system). Since the code
appears in various places—ftp site, cd-rom, various other
web sites, keeping these versions current and coordinated
is a very hard problem. One solution to knowing for sure
which version of a component you have is to use
encryption techniques on the component [13].
• Change control. Change control is the procedure for
requesting changes, deciding what changes to make,
making changes, and recording and verifying changes.
Changes to reusable assets in a library should not be
made haphazardly, but must be made under a controlled
process, though this is often not the case. Change
requests for the IR code generally comes via email. I put
reports of known bugs at the ftp website, but reports of
the same bug keep coming in, in part because the code
appears in so many places
• Build control. Keeping track of which versions of
work products go together to form a release, and
generating derived assets and systems correctly, is called
build control. Build control for reuse has two aspects.
One is the general specification of which versions of
assets to use in a system build. The other aspect is that
reusable assets may themselves be composites of other
items, so specifications of how to build assets may also
be required. Build control for the IR components is
handled with Make.
9. Searching and Understanding
Much early work on reuse focused on the building of
reuse libraries and methods for indexing components and
searching for them. Many researchers began to feel that
this aspect of reuse was sufficiently understood, and that
too much attention was given to it. The focus of reuse
research moved to design of reusable components, domain
analysis, and so on.
The internet is probably the main source now
consulted by software engineers looking for reusable
software outside their own development environment. The
main types of indexing used on the web are free text
keyword searching, and to a lesser degree enumerated
classifications. Searching on the web is made difficult by
the size and dynamics of the database, and by the fact that
different search engines will find different web pages
given the same query.
In teaching reuse courses to graduate students at
Virginia Tech, I found that they had difficulty finding
existing components on the web. For example, in one
course students needed to find stemmers on the web. I
had searched myself and knew that several different ones
could be found. Typical of their input was the following
email I received from the student who eventually received
the highest grade in the class.
"I'm still a little confused about what we should produce
for the code analysis part of the project. I know we will
try to come up with a generic architecture by looking for
similarities in the code. I think this will be hard,
considering the fact that I have only found code for one
algorithm (Porter). Are we supposed to compare different
implementations of the same algorithm?"
I found in working with the students that they did not
know how to formulate good search queries.
Another problem is helping users understand reusable
software components. This is important because if
software engineers cannot understand components, they
will not be able to reuse them. Current methods for
representing reusable components are inadequate. A study
0-7695-0559-7/00 $10.00 ĂŁ 2000 IEEE
5. of four common representation methods for reusable
software components showed that none of the methods
worked very well for helping users understand the
components [8].
We are currently doing research on visualization
techniques, such as hypertrees, hierarchical trees, and
tables, for helping users understand reusable software
components [2]. We are using the IR components as a
testbed for this research. Our visualizations are grounded
in reuse design principles, such as the 3 C's model, and
in general principles of information design such as those
of Tufte. We use an extension of XML as a modeling
language for the components.
10. Usage
Because of the different venues used to distribute the
IR components, usage data and user feedback comes from
various sources. One source is email from users typically
asking where they can find copies of the code, reporting a
bug in the code, or occasionally asking if the code can be
included in a commercial application such as the
following one received recently.
What is the status of your stemming code
(implementation of the Porter algorithm) located in
ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/stem
mer/,
is it public domain or copyrighted? The reason that
I ask is I want to know if it is okay to use it in a
search engine I am creating for my commercial
website.
I typically pass these requests on to the editor of the
book at Pretntice-Hall who approves them and asks that
the source of the component be referenced in the code and
documentation of the system in which it will be used.
This time I got a followup message,
"The email address I used was from the read.me file
that came with the stemming code - the address is
frakes@sarvis.cs.vt.edu. My lawyer wants me to have
sign something confirming the info below - is your
address at Virginia Polytechnic Institute and
State University still valid?"
This message point to two problems--how to keep
information associated with the code, in this case my
email address up to date, and how to handle legal
problems. I sent the message onto the editor at Prentice-
Hall.
The proliferation of the code on various websites is
also an indicator of usage, as is references to the code in
various web pages. Some of the web pages are papers that
reference the book or code from the book, some are
syllabi for courses, others contain variants of some of the
components written in different languages. Another source
of feedback from users can be found in reviews of the
book at websites like amazon.com.
11. Current Status and A Proposal
I am currently working with available personnel (i.e. a
graduate student) to address some of the problems
identified above. Specifically the student is doing a
semester project to:
• place the code, which now has two versions, under
change and version control using RCS.
• place the code on at least two ftp servers
• convert the code to the GNU coding and “free
software” standards.
• create a web page for the code that provides
information and pointers to the distribution sites.
• Create Documentation that will allow continuity
in the maintenance of the software.
Experience with the IR code collection shows that
current methods of development, maintenance, and
distribution work, but need improvement. Some
recommendations follow.
There is much inefficiency in the development of
components that accompany texts. For example, there are
many books that provide code that implements the basic
data structures and algorithms of computer science such as
sorting, searching, lists, stacks, queues and so on. A
standard way of cataloging these data structures and
algorithms could be quite helpful. For example, each
unique algorithm or data structure specification might be
assigned a product number similar to an ISBN number for
a book. Implementations of these specifications might
also be assigned a number that references the number of
the implemented specification. Such components might
also include information on quality assurance, indexing
terms, repository locations, and so on.
This will only happen if it makes legal and financial
sense, and the legal and financial issues are far from
solved. The case of GNU, for example, shows the
complexity of issues related to “free software”. The recent
trend towards patenting software algorithms also adds to
the difficulty of freely sharing and reusing software.
There is also the continuing question of who will provide
resources for long term maintenance tasks. These
0-7695-0559-7/00 $10.00 ĂŁ 2000 IEEE
6. important problems must be solved if we are to make
better use of existing reusable software sources.
References
[1] Aharonian, G., 1995 US Patent Statistics. 1995,
http://www.baker.com/grandunificationtheory/archive
/199601/19960121.html.
[2] Alonso, O., & William B. Frakes (2000). Visualization of
Reusable Software Assets. In W. B. Frakes (Ed.), ICSR6 Sixth
International Conference on Software Reuse, . Vienna,
Austria: Springer-Verlag.
[3] Bentley, J. (1982). Writing Efficient Programs.
Englewood Cliffs, NJ: Prentice-Hall.
[4] Collberg, C., & Thomborsen, C. (1999). Software
watermarking: Models and dynamic embeddings. In
POPL’99, 26th Annual SIGPLAN–SIGACT Symposium on
Principles of Programming Languages, (pp. 311–324).
[5] Dr.Dobbs Essential Books on Algorithms and Data
Structures, 1999
[6] Frakes, W., & Baeza-Yates, R. (Eds.). (1992). Information
Retrieval: Data Structures and Algorithms. Englewood Cliffs,
N.J.: Prentice-Hall.
[7]Frakes, W. B., & Fox., C. J. (1995). Sixteen
Questions about Software Reuse. CACM, 38(6), 75-87.
[8] Frakes, W., & Pole, T. (1994). An Empirical Study of
Representation Methods for Reusable Software Components.
IEEE Transactions on Software Engineering, , V20 n8, pp.
617-630, 1994..
[9] Frakes, W., & Terry, C. (1996). Software Reuse and
Reusability Models and Metrics. ACM Computing Surveys,
28(2), 415-435.
[10] GNU Coding Standards Copyright 1998 Free Software
Foundation, Inc.
[11] Huber, T. Reducing Business and Legal Risks in
Software Reuse Libraries. in ICSR-3. 1994. Rio de Janeiro:
IEEE-CS Press.
[12] Latour, L., Wheeler, T., & Frakes, B. (1991). Descriptive
and Prescriptive Aspects of the 3 C's Model: SETA1 Working
Group Summary. Ada Letters, XI(3), 9-17.
[13] Moore, J. W. (1994). The Use of Encryption to Ensure the
Integrity of Reusable Software Components. Third
International Conference on Software Reuse, (pp. 118-125).
Rio de Janeiro: IEEE CS Press.
[14] Stroustrup, B. (1996). Language-technical Aspects of
Reuse. In Fourth International Conference on Software Reuse,
(pp. 11-19). Orlando, FL: IEEE CS Press.
0-7695-0559-7/00 $10.00 ĂŁ 2000 IEEE