Designing and Developing an Automatic Interactive Keyphrase Extraction
System with Unified Modeling Language (UML)
College of Information Science & Technology, Drexel University, Philadelphia, PA 19104
(215) 895-2474, 01
Il-Yeol Song, Xiaohua Hu
College of Information Science & Technology, Drexel University, Philadelphia, PA 19104
(215) 895-2474, 01
Abstract tion technique. Our technique combines the Information
Designing and developing a system that assists Gain data mining measure and several Natural Language
the users in digesting and understanding Processing (NLP) techniques such as the Part Of Speech
information available has been a difficult (POS), tagger Term Frequency*Inverse Document Fre-
challenge. In this paper, we discuss the design quency (TF*IDF), and Distance from First Occurrence
and development of an automatic interactive (DIS). Information Gain is a well known data mining tech-
keyphrase extraction system, called KPSpotter, nique introduced in ID3 algorithm (Quinlan, 1993). In ap-
which is capable of processing various formats of plying POS techniques to KPSpotter, we combine several
data such as XML, HTML, and plain text through POS tagging techniques such as 1) NLPParser (Charniak,
Internet. KPSpotter combines Information Gain 2000), 2) Link-Grammar (Lafferty et al., 1993) 3) PCKim-
data mining measure and several Natural mo (Antworth, 1993), and 4) Brill’s Tagger (Brill, 1993) to
Language Processing (NLP) techniques, such as improve POS tagging accuracy. This combined approach to
Part of Speech (POS) technique and First POS techniques enables us to assign the best POS tagging
Occurrence of Term. To improve extraction to lexical tokens, constituting candidate phrases by utilizing
accuracy, WordNet is incorporated into KPSpotter. outstanding features of each POS technique.
In designing and developing KPSpotter we
It is a challenging task to design and implement a
utilized Unified Modeling Language (UML). UML
keyphrase extraction system that requires such components
modeling helps in the formalization of the
as POS library and WordNet and text processing tools. The
preliminary analysis model and accomplishes
main objective of the paper is to discuss our design and im-
iterative system design and development. We also
plementation of KPSpotter, whose goals are to be 1) flexi-
conducted experiments for system performance
ble in terms of processing various input data formats, 2) ac-
testing by comparing keyphrases extracted by
cessible through the Internet, and 3) robust in terms of ex-
KPSPotter and KEA, a well-known naïve
Baysiean-based keyphrase extraction system. The
experiments show that KPSpotter outperforms In order to improve accuracy, we incorporate WordNet’s
KEA in most test cases. capability of conversion from verb form of a term to noun
form into KPSpotter (WordNet2.0). In our previous study,
Introduction we found that a set of keyphrases of the research paper as-
Digesting information available through the Internet has signed by the author often entails noun phrases that do not
become a serious issue. There have been rigorous attempts actually appear in the text -- instead the verb form of the
to tackle this issue of information overload in the fields noun appears in the text (Song et al., 2003). Incorporating
such as topic detection, text summarization, and keyphrase WordNet into KPSpotter improves the accuracy of extract-
extraction. ing keyphrases.
In this paper, we present KPSpotter, an automatic In addition to the proposed novel extraction technique,
keyphrase extraction system that employs a novel extrac- KPSpotter is differentiated from other extraction systems in
various aspects. First, with an object-oriented system archi- The remainder of this paper is as follows: Section 2 de-
tecture perspective, KPSpotter is developed to be a flexible scribes the system architecture and development details and
keyphrase extraction system to handle various types of file section 3 explains the details of data processing and feature
formats such as HTML, XML, and ASCII, whereas other selection procedures. Section 4 reports the results of the ex-
keyphrase extraction systems such as KEA (Frank et al., periments. Section 5 discusses lessons we learned during
1999) and GenEx (Turney, 2000) require the input data to the system design and development. Finally, Section 6 con-
be certain formats. In particular, UML is used to design and cludes the paper.
develop KPSpotter to embrace a variety of data algorithms
and NLP techniques. Second, KPSpotter is an interactive System Design
keyphrase system capable of extracting keyphrases through
In this section, we describe how KPSpotter is architected.
a web interface. These strengths of KPSpotter make the
In addition, we illustrate the web interface of KPSpotter
system portable and flexible in the situation in which vari-
and explain how to use it. Throughout the development
ous data formats and system environments exist in the digi-
cycle of the system, UML was used to embed object-
orientation in the system. UML diagrams we developed
include use case, class, and activity diagram.
The effectiveness of KPSpotter was evaluated by compar-
ing the keyphrases extracted by KPSpotter with the ones As illustrated in Figure 1, KPSpotter comprises the
that the authors assigned. We then compare KPSpotter with following two stages: 1) building extraction model and 2)
KEA, a well-known naive Bayesian-based keyphrase ex- extracting keyphrases. Input of the “building extraction
traction system. The preliminary experiments show that the model” stage is training data and input of the “extracting
KPSpotter outperforms KEA in most instances. keyphrases” stage is test data or production data. Both
training and test data are processed by the three
To demonstrate flexibility of KPSpotter in handling various components: 1) Data Cleaning, 2) Data Tokenizing, and 3)
types of input format, we extract keyphrases from Web Data Discretizing. In Figure 1, the dotted line represents the
data such as html pages. From the set of candidate processing logic for “building extraction model” whereas
keyphrases extracted by KPSpotter, the user can weigh the the solid line indicates the processing logic for “extracting
candidate keyphrases and provide feedback to the system. keyphrases.” The detail descriptions are provided in the
With this user’s feedback, KPSpotter adjusts weighting following subsections. These two stages are fully
scheme to extract keyphrases. automated. Depending on the configuration parameters,
KPSpotter processes either “building extraction model”
KPSpotter can be applied to several document management mode or “extracting keyphrases” model. The outcomes of
areas. First, it can serve an extraction engine for a full-text both processes by KPSpotter are stored in the XML form.
document clustering system. A goal of the document clus-
tering system is to cluster the retrieved documents on the
fly, and it is practically impossible to cluster the full-text TF*IDF
documents due to the size of the document-term matrix that Training data Test data
the system needs to process. KPSpotter extracts keyphrases Distance from first
for the given full-text documents in an indexing time, and occurrence
then the clustering system takes keyphrases instead of full- Data Cleaning Data Tokenizing Data Discretizing
text documents as input and clusters them. Another useful
application area is information visualization. A critical is- Token DB
sue addressed in information visualization is labeling Stemming
(Song, 2000). Many visualization systems use the single Dropping special characters
Model XML DB
terms for labeling, and a single term often obscures the WordNet DB
Keyphrase XML DB
meaning of the visual objects that each label intends to rep- Case-folding DIS
resent. KPSpotter can serve a better labeling engine for the document summarization
information visualization system by supplying meaningful
keyphrases. Figure 1: System architecture of KPSpotter
Use Case Analysis
The important UML modelling that provides useful
knowledge about the usage of a system is the use case
diagram. Use case diagrams document the functionality of
a system and users of the system.
Figure 3: Class diagram of KPSpotter
The following five major components are shown in the
class diagram: 1) ModelBuilder, 2) DBHandler, 3)
POSHandler, 4) ModelManager, and 5) KeyphraseHandler.
ModelBuilder component consists of classes processing
various input formats such as HTML and XML.
DBHandler component stores statistics on candidate
phrases and input documents. POSHandler component
interfaces with the four POS Tagging libraries implemented
in KPSpotter. ModelManager component applies
discretization and WordNet’s conversion capability of verb
form to noun form. KeyphraseHandler component takes
care of extracting keyphrases based on the information gain
data mining measure.
Figure 2: Use Case Diagram of KPSpotter
As illustrated in Figure 2, an actor is shown as agent who To help understand how the system works, an activity
interacts with the system agent. This use case diagram diagram is provided (Figure 4). Activity diagrams represent
shows KPSpotter consisting largely of three components: the business and operational workflows of a system and
1) train model, 2) extract phrase, and 3) apply information show the activity and the event that causes the object to be
gain measure for extraction. in the particular state (Hofmeister, 1999).
In this section, we present the structure of KPSpotter at
class diagram level. Class diagrams provide a static
representation of the structure of a system. Class diagrams
appear in various levels of detail depending on the phase of
the lifecycle (Fowler, 2003). Figure 3 depicts a high-level
conceptual class diagram of KPSpotter.
Figure 4. Activity diagram of KPSpotter
As illustrated in Figure 4, depending on the process mode
of the system, KPSpotter handles test data or train data.
Consequently, it either generates a train model or extracts
keyphrases. Which path KPSpotter takes is determined by Evaluation
the configuration settings in the form of XML. In this section, we report the preliminary experimental
results of the performance of KPSpotter.
Web Interface of KPSpotter We measured the performance of KPSpotter by comparing
In this section, we describe a web interface of KPSpotter. key phrases with human-generated key phrases. Turney
KPSpotter provides a web interface for the user to access (2000) reports that an average of about 75% of the human-
through the Internet (Figure 5). For the process mode, the generated keyphrases appears in the body of the
user can select either “train model” or “extract keyphrases.” corresponding document in his experiment data. With these
In the current state of the system, there are three options findings, he argued that an ideal keyphrase extraction
available in order to provide input data. The first option is algorithm could generate phrases that match up to 75% of
for the input data to be accessible by the http protocol. By a the author’s keyphrases.
URL that the user provides, KPSpotter fetches and Taking this result into consideration, optimistically
processes the data. The second option is that the user can speaking, KPSpotter needs to extract three to four
directly put the input data into the textbox. The last option keyphrases matched from the list of keyphrases that the
makes it possible for the user to upload the input data. authors provided. The overall performance of KPSpotter is
shown in Table 1 and also illustrated further in Figure 6.
For the given test documents, KPSpotter extracted more
than two “correct keyphrases” on average. By integrating
with WordNet, we gain significant accuracy improvement
comparing to our previous experiments (Song et al., 2003).
Key No of keyphrases keyphrases
phrase keyphrases without withWordN
Range extracted WordNet et
5 96 0.90327 1.30327
10 115 1.334 1.716
15 124 1.524 2.179
Figure 5: Web interface of KPSpotter
20 130 1.728 2.3556
The output of executing KPSpotter, a list of keyphrases, is Table 1. Overall quality of KPSpotter by accuracy
displayed on the browser in XML form (Figure 6). In
Figure 6, for the record id, 10004, total fifteen keyphrases In Figure 7, the first line from the top is the average
are extracted and each keyphrase is weighed with number of keyphrases that the authors assigned. The
information gain data mining measure. second line from the top shows the number of keyphrases
that appears in the documents. The third one indicates the
average number of correct identifications.
Number of correct
3 appearing in
2 abstract text
1 assigned by
0 5 10 15 20 25
Number of keyphrases
Figure 6: Sample Keyphrases extracted by KPSpotter Figure 7. Overall Performances
A similar result is reported by KEA (Witten et al., 1999).
KEA generates about one to two “correct keyphrases.” As
illustrated in Figure 8, KPSpotter outperforms KEA in the Lesson Learned
first four cases (For the last case, the result from KEA was In this section, we summarize the lessons we learned in
not available). The results from the experiments seem to developing KPSpotter with object-oriented technologies.
indicate that KPSpotter produces acceptable performance
in terms of average number of matches. Third party software dependency: We used three POS
libraries to identify the word sense. Some major issues on
memory leaks and performance were raised due to the bugs
No of keyphrases
3 of the third party library. Since the communication channel
2 KEA with the third party company wasn't established in an
1 KPSpotter efficient manner, it took a while to fix the problems. We
felt that it is critical to establish a solid communication
5 10 15 20 25
channel with the third party software developers early in
the development phase.
Figure 8. Performance Comparison System design with UML: After the requirements gathering
was finished, there was not sufficient time to develop a
fairly mature set of analysis specifications due to the tight
In order to demonstrate extensibility and flexibility of development schedule. However, the core diagrams in
KPSpotter, we extracted keyphrases from publication- UML such as use case, class, and sequence diagram
related web data available in IEEE digital library (IEEE improved the design team members' understanding about
digital library). We chose IEEE digital library because the project in a timely manner and also helped to develop
publication-related Web pages provided by IEEE digital quality software (Hofmeister, 1999). Since the
library contain not only reasonably sized abstracts, but also requirements were continuously changing, we had to
keyphrases generated by the authors. We obtained 150 modify the diagrams to reflect the changes of requirement
publication-related web pages for training data and 50 web specifications in the design through an UML tool.
pages for testing data. KPSpotter then parses HTML pages However, due to the tight development schedule, we
and stores author-generated keywords and abstracts bypassed this update process and directly changed the code
separately. Figure 9 shows the sample keyphrases extracted instead. It elicited confusion among the developers in
from the publication web page whose the title of the article discussing the code changes. It was especially confusing
is “A Relevance Feedback Architecture for Content-based for the developers participating in the project at a later
Multimedia Information Retrieval Systems.“ The list of stage. Throughout the development of the system, we
keyphrases given by the authors of the article includes: 1) realized that it is crucial to update the UML diagrams and
Multimedia Information Retrieval, 2) Relevance Feedback, reflect the changes of requirement specifications in the
and 3) Content-based Image/Video Retrieval. It should be design prior to the code changes.
noted that although the size of training data is small (20
Web sites), KPSpotter is able to match one or two Handling special characters in XML entity: Since our
keyphrases out of five keyphrases generated by the authors. XML-formatted web database contains data written in
English as well as data in other languages, we had to cope
with special characters such as ë or ä. The XML parser we
used abruptly terminated its execution when it processed
those special characters. To work around this problem, we
took an ad hoc approach by replacing those characters in
raw data with corresponding encoded characters. This is a
well-known issue with XML parsers in handling some
foreign characters in XML entities. For
internationalization, handling of special characters needs to
be addressed in the XML parser enhancement.
Lack of communication among the development team: We
realized an effective communication channel between
development team members must be developed. Several
developers wrote different pieces of the C++ classes based
Figure 9. A sample of keyphrases extracted from medical on common class libraries (STL) simultaneously. As a
data result, we experienced some inconsistency and redundancy
in the writing of the program. In addition, the project
suffered from ineffective communication among the KPSpotter can serve an extraction engine in the following
internal clients, project mangers, and developers. several different document management areas: 1) a full-text
document clustering system, which benefits from
KPSpotter by clustering documents with keyphrases and 2)
Conclusion an information visualization system, which utilizes
In this paper, a flexible automatic keyphrase extraction KPSpotter for generating meaningful labels for visual
system, called KPSpotter, is proposed. KPSpotter employs objects.
a new technique combining the Information Gain data
mining measure and several Natural Language Processing
We are conducting experiments of the performance
techniques such as stemming and case-folding. The three
comparison in predicting keyphrases among different data
features by identified by KPSpotter for candidate
mining techniques such as Information Gain, Support
keyphrases are 1) TF*IDF, 2) distance from the first
Vector Machine (SVM), and K-Nearest Neighbor. These
occurrence of the phrase, and 3) POS tagging.
mining algorithms have been successfully applied to
KPSpotter was designed and developed in the spirit of document classification tasks.
object-oriented design and analysis. In particular, in order
to help understand the system architecture in an effective
way, UML notions and diagrams were employed. We also Regarding the methodology of experiment, we are also
reported the lesion learned from designing and developing undertaking a study on the robust and sophisticated
KPSpotter with UML. accuracy measures for usability of the system. The
emphasis of the follow-up study is on measuring usefulness
of keyphrases to the users of digital libraries.
KPSpotter is characterized and differentiated from other
keyphrase extraction systems by the following: 1) it
introduces an extraction technique combining Information
Gain and Natural Language Processing techniques; 2) it
provides a web interface for the user to obtain a list of Antworth, Evan L. (1993) Glossing text with the PC-KIMMO
keyphrases for the supplied input data; 3) it processes morphological parser. Computers and the Humanities. pp.
various types of input data such as XML, HTML, and 475-484.
unstructured text data and generate XML output; 4) it
stores statistical information of candidate phrases to Brill, Eric (1993) Automatic Grammar Induction and Parsing Free
BerkeleyDB, a persistent object storage device; 5) it also Text: A Transformation-Based Approach. In: Proceedings of
stores both the model and list of keyphrases for the target ACL, 259-265.
document in a XML file; and 6) WordNet is incorporated
Caropreso, F.M., Matwin, S. and Sebastiani, F (2001) A learner-
into the system to improve extraction accuracy. These
independent evaluation of the usefulness of statistical phrases for
features of KPSpotter make it suitable for the real world automated text categorization. In: Amita G. Chin (ed.), Text
application where robustness, flexibility, and speed are Databases and Document Management: Theory and Practice,
important. Idea Group Publishing, Hershey, US, pp. 78-102.
Charniak E., (2000) A Maximum-Entropy-Inspired Parser. In:
To evaluate the performance of the system, we conducted a Proceedings of NAACL-2000.
series of experiments and reported the experimental results.
KPSpotter outperforms KEA in the cases of extracting 5 to Dougherty, J., Kohavi, R. and Sahami, M. (1995) Supervised and
15 keyphrases and also demonstrates equivalent extraction unsupervised discretization of continuous features. In: Proceeding
quality to KEA in extracting 20 keyphrases in terms of the of ICML-95, 12th International Conference on Machine Learning,
number of matches between system-generated and human- Lake Tahoe, US, pp.194--202.
generated keyphrases. The correct keyphrases out of 25 Fowler M. (2003) UML Distilled: A Brief Guide to the Standard
keyphrases extracted by KPSpotter is more than two on Object Modeling Language. Adison-Wesley.
average. In addition, the results of extracting keyphrase
from publication-related web sites in IEEE digital library Frank E., Paynter G.W., Witten I.H., Gutwin C. and Nevill-
indicated that KPSpotter is capable of extracting Manning C.G. (1999) Domain-specific keyphrase extraction, In:
meaningful sets of keyphrases. These findings are Proc. Sixteenth International Joint Conference on Artificial
encouraging because KPSpotter is able to extract one or Intelligence, Morgan Kaufmann Publishers, San Francisco, CA,
two matched keyphrases despite that the size of training pp. 668-673.
data was small, 50 web sites.
Hofmeister C, Nord RL, Soni D. (1999) Describing Software
Architecture with UML, In: 1st Working IFIP Conference on
Software Architecture (WICSA1), Feb 22-24, pp. 145-159.
Lafferty J., Sleator D., and Temperley D. (1992) Grammatical
Trigrams: A Probabilistic Model of Link Grammar. In:
Proceedings of the AAAI Conference on Probabilistic Approaches
to Natural Language, October.
Song, M, Song, I.Y., and Hu, T. (2003) KPSpotter: A Flexible
Manning C. Manning and Schütze H., (1999) Foundations of
Information Gain-based Keyphrase Extraction System, Fifth
Statistical Natural Language Processing, MIT Press. Cambridge,
International Workshop on Web Information and Data
Management (WIDM'03), In Conjunction with the 12th
International Confe rence on Information and Knowledge
Porter, M.F. (1980) An algorithm for suffix stripping, Program,
Management (CIKM 2003),November 7-8, 2003.
14(3), pp. 130-137.
Turney, P.D. (2000) Learning algorithms for key phrase
Quinlan, J. R. (1993) Programs for Machine Learning, San
extraction. Information Retrieval, Information Retrieval, 2, pp.
Mateo: Morgan Kaufmann Publishers.
Radev D.R, Qi H., Zheng, Z., Blair-Goldensohn S., Zhang Z., Fan
Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-
W., and Prager J. (2001) Mining the web for answers to natural
Manning C.G. (1999) KEA: Practical automatic keyphrase
language questions. In: ACM CIKM 2001: Tenth International
extraction. In: Proc. DL '99, pp. 254-256.
Conference on Information and Knowledge Management, Atlanta,
Song, M. (2000) Visualization in information retrieval: a three-
level analysis, Journal of Information Science, 26 (1): 3-19. IEEE Digital Library,