Requirement Analysis Version 0.4
by the Stat Team
The Stat Project, guided by
Professor Eric Nyberg and Anthony Tomasic
Feb. 25, 2009
Introduction to Stat
In this chapter, we give an brief introduction to the Stat project to audience reading this document.
We explain the background, the motivation, the scope, and the stakeholders of this project so that
audience can understand why we are doing so, what we are going to do, and who may be interested
in our project.
Stat is an open source machine learning framework in Java for text analysis with focus on semi-
supervised learning algorithms. Its main goal is to facilitate common textual data analysis tasks
for researcher and engineers, so that they can get their works done straightforwardly and eﬃciently.
Applying machine learning approaches to extract information and uncover patterns from tex-
tual data has become extremely popular in recent years. Accordingly, many software have been
developed to enable people to utilize machine learning for text analytics and automate such pro-
cess. Users, however, ﬁnd many of these existing software diﬃcult to use, even if they just want
to carry out a simple experiment; they have to spend much time learning those software and may
ﬁnally ﬁnd out they still need to write their own programs to preprocess data to get their target
We notice this situation and observe that many of these can be simpliﬁed. A new software
framework should be developed to ease the process of doing text analytics; we believe researchers
or engineering using our framework for textual data analysis would feel the process convenient,
conformable, and probably, enjoyable.
Existing software with regard to using machine learning for linguistic analysis have tremendously
helped researchers and engineers make new discoveries based on textual data, which is unarguably
one of the most form of data in the real world.
As a result, many more researchers, engineers, and possibly students are increasingly interested
in using machine learning approaches in their text analytics. However, the bar for entering this
area is not low. Those people, some of which even being experienced users, ﬁnd existing software
packages are not generally easy to learn and convenient to use.
For example, although Weka has a comprehensive suite of machine learning algorithms, it is
not designed for text analysis, lacking of naturally supported capabilities for linguistic concepts
representation and processing. MinorThird, on the other hand, though designed speciﬁcally as a
package for text analysis, turns out to be rather complicated and diﬃcult to learn. It also does not
support semi-supervised and unsupervised learning, which are becoming increasingly important
machine learning approaches.
Another problem for many existing packages is that they often adopt their own speciﬁc input
and output format. Real-world textual data, however, are generally in other formats that are not
readily understood by those packages. Researchers and engineers who want to make use of those
packages often ﬁnd themselves spending much time seeking or writing ad hoc format conversion
code. These ad hoc code, which could have been reusable, are often written over and over again
by diﬀerent users.
Researchers and engineers, when presented common text analysis tasks, usually want a text-
speciﬁc, lightweight, reusable, understandable, and easy-to-learn package that help them get their
works done eﬃciently and straightforwardly. Stat is designed to meet their requirements. Moti-
vated by the needs of users who want to simplify their work and experiment related to textual data
learning, we initiate the Stat project, dedicating to provide them suitable toolkits to facilitate
their analytics task on textual data.
In a nutshell, Stat is an open source framework aimed at providing researchers and en-
gineers with a integrated set of simpliﬁed, reusable, and convenient toolkits for textual
data analysis. Based on this framework, researchers can carry out their machine learning
experiments on textual data conveniently and comfortably, and engineers can build their
own small applications for text analytics straightforwardly and eﬃciently.
From the comprehensiveness of features point of view, this framework may not be the most
suitable one compared to other existing packages. However, we should dedicated to make all the
code we write well-designed, eﬃcient, and reliable. need change.
This project involves developing a simpliﬁed and reusable framework (a collection of foundation
classes) in Java that provides basic and common capabilities for people to easily perform machine
learning analysis on various kind of textual data.
Add what aspects speciﬁcally we will going to do here.
Below is the list of stakeholder and how this project will aﬀect them:
• Researchers, particularly in language technology but also in other ﬁelds, would be able
to save time by focusing on their experiments instead of dealing with various input/output
format which is routinely necessary in text processing. They can also easily switch between
various tools available and even contribute to STAT so that others can save time by using
their adaptors and algorithms.
• Software engineers, who are not familiar with the machine learning can start using the
package in their program with a very short learning phase. STAT can help them develop clear
concepts of machine learning quickly. They can build their applications using functionality
provided STAT easily and achieve high level performance.
• Developers of learning package, can provide plug-ins for STAT to allow ease of integration
of their package. They can also delegate some of the interoperability needs through this
program (some of which may be more time consuming to be addressed within their own
• Beginners to text processing and mining, who want fundamental and easy to learn
capabilities involving discovering patterns from text. They will be beneﬁted from this project
by saving their time, facilitating their learning process, and sparking their interests to the
area of language technology.
This project was faced with many challenges from the beginning. There are many question, some
of subjective nature, that really needs to be addresses by our target audience. For this reason, we
designed a survey to obtain a better understanding and provide a more suitable solution to this
problem. In this chapter, we explain the process of designing the survey, collecting information
and some analysis of the collected data.
2.1 Designing the Survey
The primary goals of doing a survey was the following:
• Understanding the potential users of the package: their programming habit, problem solving
strategies, experience in various area and tools, etc.
• Setting priority for which criteria to focus on for our design and implementation
The survey needed to be short and question to be very speciﬁc to get better responses. The
maximum number of question was set at 10 questions. Several draft of the questions was reviewing
within the STAT group and the software engineering class students and instructors several times
until ﬁnalize. We also obtained and incorporate some advices from other departments. The ﬁnal
survey was designed on the SurveyMonkey.com.
The target users of STAT are two main groups with diﬀerent needs: researchers and industry
programmer. The survey contains questions to distinguish there two group but the ﬁnal framework
should address the needs from both groups. After conducting a test run with this the STAT group
and the class, we sent the survey out to the Language Technology Institute student mailing list
(representing researchers) and also to student in iLab (Prof, Ramayya Krishnan, Heinz School of
Business) representing industry programmers.
2.3 Analysis of Results
As of 2/25/09, we have received 23 responses and they are individually reviewed by STAT members
and also in aggregate. Below we summarized the ﬁnding of the survey result and some charts:
• While many diﬀerent programming language are used (Python, R, C++) but over 90
• Users don’t seem to distinguish much between industry and research applications and this is
perhaps more research for the diﬀerent to be transparent.
• Most users are not familiar with Operation Research but everyone is somewhat familiar with
Machine Learning (if not speciﬁcally text classiﬁcation or data mining).
• Data type expectedly were mostly textual (plain, XML, HTML, etc. as opposed to Excel,
though it was mentioned) and sources were ﬁles, databases and web.
• Over 50
• Easy of API use, Performance and Extensibility were the top three choice in design but in
addition to those in textual descriptions user pointed out mostly problems with input and
Charts to be added here...
Analysis of Related Packages
In this chapter, we analyze a few main competitors of our projects. We focus on two academic
toolkits – Weka and MinorThird. We comment on their strengths and explore their limitations, and
discuss why and how we can do better than these competitors.
Weka is a comprehensive collection of machine learning algorithms for solving data mining problems
in Java and open sourced under the GPL.
3.1.1 Strengths of Weka
Weka is a very popular software for machine learning, due to the its main strengths:
• Provide comprehensive machine learning algorithms. Weka supports most current
machine learning approaches for classiﬁcation, clustering, regression, and association rules.
• Cover most aspects for performing a full data mining process. In addition to learn-
ing, Weka supports common data preprocessing methods, feature selection, and visualization.
• Freely available. Weka is open source released under GNU General Public License.
• Cross-platform. Weka is cross-platform fully implemented in Java.
Because of its supports of comprehensive machine learning algorithm, Weka is often used for
analytics in many form of data, including textual data.
3.1.2 Limitations of using Weka for text analysis
However, Weka is not designed speciﬁcally for textual data analysis. The most critical drawback
of using Weka for processing text is that Weka does not provide “built-in” constructs for natural
representation of linguistics concepts1 . Users interested in using Weka for text analysis often ﬁnd
themselves need to write some ad-hoc programs for text preprocessing and conversion to Weka
• Not good at understanding various text format. Weka is good at understanding its
standard .arﬀ format, which is however not a convenient way of representation text. Users
have to worry about how can they convert textual data in various original format such as
Though there are classes in Weka supporting basic natural language processing, they are viewed as auxiliary
utilities. They make performing basic textual data processing using Weka possible, but not conveniently and straight-
raw plain text, XML, HTML, CSV, Excel, PDF, MS Word, Open Oﬃce document, etc. to
be understandable by Weka. As a result, they need to spend time seeking or writing external
tools to complete this task before performing their actual analysis.
• Unnecessary data type conversion. Weka is superior in processing nominal (aka, categor-
ical) and numerical type attributes, but not string type. In Weka, non-numerical attributes
are by default imported as nominal attributes, which usually is not a desirable type for text
(imagine treating diﬀerent chunks of text as diﬀerent values of a categorical attribute). One
have to explicitly use ﬁlters to do a conversion, which could have been done automatically if
it knows you are importing text.
• Lack of specialized supported for linguistics preprocessing. Linguistics preprocessing
is a very important aspect of textual data analysis but not a concern of Weka. Weka does
not (at least, not dedicated to) take care this issue very seriously for users. Weka has a
StringToWordVector class that performs all-in-one basic linguistics preprocessing, including
tokenization, stemming, stopword removal, tf-idf transformation, etc. However, it is less
ﬂexible and lack of other techniques (such as part-of-speech tagging and n-gram processing)
for users who want ﬁned grain and advanced linguistics controls.
• Unnatural representation of textual data learning concepts. Weka is designed for
general purpose machine learning tasks so have to protect too many variations. As a results,
domain concepts in Weka are abstract and high-level, package hierarchy is deep, and the
number of classes explodes. For example, we have to use Instance rather than Document and
Instances rather than Corpus. Concepts in Weka such as Attribute is obscure in meaning
for text processing. First adding many Attribute to a cryptic FastVector which then passed
to a Instances in order to construct a dataset appears very awkward to users processing
text. Categorize ﬁlters ﬁrst according to attribute/instance then supervised /unsupervised
make non-expert users feel confusing and hard to ﬁnd their right ﬁlters. Many users may feel
unconformable programmatically using Weka to carry out their experiments related to text.
In summary, for users who want enjoyable experience at performing text analysis, they need
built-in capabilities to naturally support representing and processing text. They need specialized
and convenient tools that can help them ﬁnish most common text analysis tasks straightforwardly
and eﬃciently. This cannot be done by Weka due to its general-purpose nature, despite its com-
3.1.3 Detail design defects of Weka from the perspective of text analysis
Figure 3.1: Partial domain model for Weka for basic text analysis
Here we ﬁrst explain in detail the major features of our framework.
• Simpliﬁed. APIs are clear, consistent, and straightforward. Users with reasonable Java
programming knowledge can learn our package without much eﬀorts, understand its logical
ﬂow quickly, be able to get started within a small amount of time, and ﬁnish the most common
tasks with a few lines of code. Since our framework is not designed for general purposes and
for including comprehensive features, there are space for us to simplify the APIs to optimize
for those most typical and frequent operations.
• Reusable. Built-in modular supports are provided the core routines across various phases in
text analysis, including text format transformation, linguistic processing, machine learning,
and experimental evaluation. Additional functionalities can be extended on top of the core
framework easily and user-deﬁned speciﬁcations are pluggable. Existing code can be used
cross environment and interoperate with external related packages, such as Weka, Minor-
Third, and OpenNLP. (I use reusable instead of extendable because it cover a higher level of
concept we might also need and able to follow, what’s your idea? )
• Any other?
4.1 Functional Requirements
In this section, we deﬁne most common use cases of our framework and address them in the degree
of detail of casual use case. The “functional requirements” of this project are that the users can
use libraries provided by our framework to complete these use cases more easily and comfortably
than not use.
Since our framework assumes that all users of interests are programming using our APIs, there is
only one role of human actor, namely the programmer. This human actor is always the primary
actor. There are some possible secondary and system actors, namely the external packages our
framework integrates, depending on what speciﬁc use cases the primary actor is performing.
Fully-dressed Use Cases
Use Case UC1: Document Classiﬁcation Experiment
Scope: Text analysis application using STAT framework
Level: User goal
Primary Actor: Researcher
Stakeholder and Interests:
• Researcher: Want to test and evaluate a classiﬁcation algorithm (supervised, semi-
supervised or unsupervised) by applying it on a (probably well-known) corpus; the task
needs to be done eﬃciently with easy and straightforward coding
• STAT framework is correctly installed and conﬁgured
• The corpus is placed on a source readable by STAT framework
• A model is trained and test documents in the corpus are classiﬁed. Evaluation results
Main Success Scenario:
1. Researcher imports the corpus from its source into memory. Speciﬁcally, the system
reads data from the source, parses the raw format, extracts information according to
the schema, and constructs an in-memory object to store the corpus
2. Researcher performs preprocessing on the corpus. Speciﬁcally, for each document, the
researcher tokenizes the text, removes the stopwords, performs stemming on the tokens,
performs ﬁltering, and/or other potential preprocessing on body text and meta data
3. Researcher converts the corpus into the feature vectors needed for machine learning.
The feature vectors are created by analyzing the documents in the corpus, deriving or
ﬁltering features, adding or removing documents, sampling documents, handling missing
entries, normalizing features, selecting features, and/or other potential processing
4. Researcher splits the processed corpus into training and testing set
5. Researcher chooses a machine learning algorithm, set its parameters, and uses it to train
a model from the training set
6. Researcher classiﬁes the documents in the test set based on the model trained
7. Researcher evaluates the classiﬁcation based on classiﬁcation results obtained on the
test set and its true labels. Classiﬁcation is evaluated mainly on classiﬁcation accuracy
and classiﬁcation time or if it is unsupervised, on other unsupervised metrics such as
Adjusted Rand Index.
8. Researcher displays the ﬁnal evaluation result
Use Case UC1: Document Classiﬁcation Experiment (cont.)
1a. The framework is unable to ﬁnd the speciﬁed source.
1. Throw source not found exception
1b. Researcher loads a previously saved corpus in native format from a ﬁle on the disk directly
to memory object, thus researcher does not handle source, format, or schema explicitly.
1a. File not found:
1.Throw ﬁle not found exception
1b. Malformed native format:
1.Throw malformed native format exception
4a. Researcher specify a parameter k larger than the number of document or smaller than 1
1. Throw invalid argument exception
1-3, 5a. Researcher saves the in-memory objects of diﬀerent level of processed corpus rep-
resentation to disk in native format which can be loaded back lately, after ﬁnishing each
1-3, 5b. Research exports the in-memory objects of diﬀerent processed corpus representation
to disk in external formats (e.g., weka arﬀ, csv) which can be processed by external software.
6a. Researcher saves the in-memory model object to disk, which can be loaded back lately.
6b. Researcher loads a previously saved model in native format from a ﬁle on the disk directly
to memory object.
1a. File not found:
1. Throw ﬁle not found exception
1b. Malformed native format:
1.Throw malformed native format exception
4-8b. To perform k-fold cross validation, the corpus is split to k parts in step 4, and steps
5-8 are repeated k-times by switching each split a testing split and the rest as training.
Researcher combines the evaluations on diﬀerent test sets obtained in the previous steps and
forms a ﬁnal classiﬁcation evaluations
6c. Unsupported learning parameters (the learning algorithm cannot handle the combination
of parameters the researcher speciﬁes)
1. Throw unsupported learning parameters exception
6d. Unsupported learning capability (the learning algorithm cannot handle the format and
data in training set, potentially caused by unsupported feature type, class type, missing
1. Identify exception cause(s)
2. Throw corresponding exception(s)
8a. Incompatible between test set and classiﬁcation (potentially caused by diﬀerence in
schema between training set and test set)
1. Throw incompatible evaluation exception
Use Case UC1: Document Classiﬁcation Experiment (cont.)
10a. The researcher customizes the display instead of using the default display format.
1.The researcher obtains speciﬁc ﬁelds of the evaluations via interfaces provided
2.The researcher constructs a customized format using the ﬁelds he/she extracts
3.The researcher display it customized format and/or write to a destination
• Pluggable preprocessors in step 2-3
• Pluggable learning algorithm in step 6
• Learning algorithm should be scalable to deal with large corpus
• Researcher should be able to visualize results after various steps to trace the state of
diﬀerent objects (e.g., preprocessed corpus, models, classiﬁcations, evaluations)
• Researcher should be able to customize the visualization output
• How to address the variations issues in reading diﬀerent sources
• How to (in what form) let research specify parameters for diﬀerent learning algorithms
• What speciﬁcally need to be able to export, persist, and visualize?
• How to implement the corpus splitting in an eﬃcient way (dont create extra objects)
• How to deal with performance issues of storing large corpora in the memory
• How to deal with internal representation of the dataset in eﬃcient data structure
4.2 Non-functional Requirements
• Open source. It should be made available for public collaboration, allowing users to use,
change, improve, and redistribute the software.
• Portability. It should be consistently installed, conﬁgured, and run independent to diﬀerent
platforms, given its design and implementation on Java runtime environment.
• Documentation. Its code should be readable, self-explained, and documented clearly and
unambiguously for critical or tricky part. It should include an introduction guide for users
to get started, and preferably, provides sample dataset, tutorial, and demos for user to run
examples out of the box.
• Performance. It should be able to response to user within reasonable amount of time given
a limited amount of data (unclear, need specify). Preferably, it can estimate the running
time needed to perform a task and notify user before user actually execute the task (is this
the responsibility for framework designers? )
• Dependency. It is actually a issue. The package integrates other external packages and has
many dependency. How to resolve this issue? How do we distribute our package?