A Management and Visualisation Tool for

                    Text Mining Applications

                                   ...
1 TABLE OF CONTENTS

1     TABLE OF CONTENTS                             2

2     ACKNOWLEDGEMENT                         ...
7.5     Class Diagram                                32

8      DATABASE                                      33

8.1    E...
11.1 Getting Started                           53
  11.1.1 Input Data                            53

11.2   Loading a Reso...
2 ACKNOWLEDGEMENT

I would like to thank the following people for their help over the course of this project:


Rajesh Pam...
3 ABSTRACT

This report describes the design and implementation of a management and visualisation
tool for text classifica...
4 INTRODUCTION


This report describes the project carried out to implement a management and
visualisation tool for text c...
5 BACKGROUND


5.1 Written Text


Writing has long been an important means of exchanging information, ideas and
concepts f...
intelligence (AI) [11]. Data classification by machine learning is a two-phase process
(Figure 1). The first phase involve...
classification experiments. Bayesian classifiers are statistical classifiers. Classification
is based on the probability t...
C (1)        O (1)        O (1)       L (1)




       Root
                                   O (1)        L (1)

       ...
classification and the result of the experiment showed that a classifier employing a suffix
tree outperformed the Naïve Ba...
partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping
blocks/folds. Then the training-te...
6 HIGH-LEVEL APPLICATION DESCRIPTION

6.1 Description and Rationale
The aim of this project is to build a management and v...
6.1.2    Evaluate and Refine the Classifier
In research once a classifier has been built it is desirable to evaluate its e...
A Windows application was built as the client. This forms the interface that the user
interacts with to gain access to the...
7 DESIGN

7.1 Functional Requirements
Requirements for the application were collected from research on natural language te...
Primary Actor:                User
Preconditions:                A source corpus is successfully loaded
Postconditions:   ...
each run containing training set and
                                                         test set.
                  ...
1. User selects the cross-validation               2. Classify all documents under the
      run to classify              ...
Use Case Name                 View n-Gram Matches in document
Primary Actor:                User
Preconditions:           ...
Use Case Name                 Delete Classifier
Primary Actor:                User
Preconditions:                Classifie...
7.2.3    Documentation
Help menus and tool tips will be available to help users interact with the system. The
application ...
Input Data

                                                                                                   System Boun...
Graphical User
                                                                                Others...
                 ...
For this project Windows forms were chosen for the implementation because most users
are familiar with the Windows form in...
that manages the underlying model of the classifier, and the underlying model itself. The
Central Manager is a controller ...
passed back to the STClassifier Manager. It is essentially a wrapper class for the
STClassifier.
The suffix tree is built ...
When a corpus is loaded into the system as input data. The user can create sampling
sets from the initial corpus and also ...
The methods the Cross-Validation class is expected to perform are:
          Set the number of N-folds
          Run N-fol...
Results Manager                 Central Manager




   XML File             Database
   Manager              Manager




 ...
7.5 Class Diagram
Figure 14 shows a class diagram of the main components of the system discussed above


                 ...
8 DATABASE

8.1 Entities
All the data in the system is stored in an Access database. The following describes the
organisat...
This table contains the name description of score functions.


8.1.5   Match Normalisation Functions Table




This table ...
This is a temporary table used to cache the maximum and minimum scores for a class
grouped by document, configuration.


8...
section for scoring configuration description). Figure 15 shows the relationships
between the main tables.


             ...
9 IMPLEMENTATION

Due to the large size of the program, this report will not cover all the different
implementation detail...
lblSTreeDetail/listView




 tvExplorer




 rtxtInfo                                                              rtxtVie...
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
Upcoming SlideShare
Loading in...5
×

msword

599

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
599
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "msword"

  1. 1. A Management and Visualisation Tool for Text Mining Applications Student Peishan Mao MSc Computing Science Project Report School of Computing Science and Information System Birkbeck College, University of London 2005 Status Draft Last saved 26 Apr. 10 1 of 93
  2. 2. 1 TABLE OF CONTENTS 1 TABLE OF CONTENTS 2 2 ACKNOWLEDGEMENT 5 3 ABSTRACT 6 4 INTRODUCTION 7 5 BACKGROUND 8 5.1 Written Text 8 5.2 Natural Language Text Classification 8 5.2.1 Text Classification 8 5.2.2 The Classifier 9 5.3 Text Classifier Experimentations 12 6 HIGH-LEVEL APPLICATION DESCRIPTION 14 6.1 Description and Rationale 14 6.1.1 Build a Classifier 14 6.1.2 Evaluate and Refine the Classifier 15 6.2 Development and Technologies 15 7 DESIGN 17 7.1 Functional Requirements 17 7.2 Non-Functional Requirements 22 7.2.1 Usability 22 7.2.2 Hardware and Software Constraint 22 7.2.3 Documentation 23 7.3 System Framework 23 7.4 Components in Detail 25 7.4.1 The Client - User Interface 25 7.4.2 Display Manager 26 7.4.3 The Classifier 26 7.4.4 Data Manipulation and Cleansing 28 7.4.5 Experimentation 29 7.4.6 Results Manager 30 7.4.7 Error Handling 31 2 of 93
  3. 3. 7.5 Class Diagram 32 8 DATABASE 33 8.1 Entities 33 8.1.1 Score Table 33 8.1.2 Source Table 33 8.1.3 Configuration Table 33 8.1.4 Score Functions Table 33 8.1.5 Match Normalisation Functions Table 34 8.1.6 Tree Normalisation Functions Table 34 8.1.7 Classification Condition Table 34 8.1.8 Class Weights Table 34 8.1.9 Temporary Max and Min Score Table 34 8.2 Views 35 8.2.1 Weighted Scores 35 8.2.2 Maximum and Minimum Scores 35 8.2.3 Misclassified Documents 35 8.3 Relation Design for the Main Tables 35 9 IMPLEMENTATION 37 9.1 Main User Interface 37 9.2 Display Manager 39 9.3 Classifier Classes 40 9.4 Results Output Classes 41 9.5 Other Controller Classes 43 9.6 TreeView Controller Class 44 9.7 Error Interface 45 10 IMPLEMENTATION SPECIFICS 46 10.1 Generic Selection Form Class 46 10.2 Visualisation of the Suffix Tree 48 10.3 Dynamic Sub-String Matching 49 10.4 User Interaction Warnings 50 11 USER GUIDE 53 3 of 93
  4. 4. 11.1 Getting Started 53 11.1.1 Input Data 53 11.2 Loading a Resource Corpus 54 11.3 Selecting a Sampling Set 57 11.4 Performing Pre-processing 61 11.5 Running N-Fold Cross-Validation 64 11.5.1 Set Up Cross-Validation Set 64 11.5.2 Perform experiments on the data 67 11.5.2.1 Create the Suffix Tree 67 11.5.2.2 Display Suffix Tree 69 11.5.2.3 Delete Suffix Tree 71 11.5.2.4 N-Gram Matching 71 11.5.2.5 Score Documents 73 11.5.2.6 Classify documents 74 11.5.2.7 Add New Document to Classify 76 11.6 Creating a Classifier 79 12 TESTING 81 13 CONCLUSION 83 13.1 Evaluation 83 13.2 Future Work 84 14 BIBLIOGRAPHY 86 15 APPENDIX A DATABASE 88 16 APPENDIX B CLASS DEFINITIONS 90 17 APPENDIX C SOURCE CODE 93 4 of 93
  5. 5. 2 ACKNOWLEDGEMENT I would like to thank the following people for their help over the course of this project: Rajesh Pampapathi: for his spectrum of help on the project, ranging from his patient and advice on the whole area of text classification, and pointing me in the right direction for information on the topic to being interviewed as a potential user to the proposed system as part of the requirement collection. Timothy Yip: for laboriously proof reading the draft for the report despite not having much interest in information technology. 5 of 93
  6. 6. 3 ABSTRACT This report describes the design and implementation of a management and visualisation tool for text classification applications. The system is built as a wrapper for machine learning classification tool. It aims to provide a flexible framework to accommodate for future changes to the system. The system is implemented in C# .Net with a Windows Forms front end and an Access Database as an example, but should be flexible enough to add different underlying components. 6 of 93
  7. 7. 4 INTRODUCTION This report describes the project carried out to implement a management and visualisation tool for text classification. It covers background information about the project, the design, implementation and conclusion. The report is organised as follows: Section 4 this section. It describes the organisation of the report. Section 5 takes a look at the background of the project. This section covers discussion on natural language classification, and suffix tree data structure used in Pampapathi et al‟s study. Section 6 a high-level description and rationale of the system. Section 7 describes the design of the system. Lays out the system requirements, system framework, and describes system components and classes. Section 8 explains the database design and description of the database entities and table relations. Section 9 discusses how the system was implemented and goes into class definitions. Section 10 focuses on specific system implementations and looks at the implementation of the generic selection form class, visualisation of the suffix tree, dynamic sub-string matching on documents, and user warnings. Section 11 is the user guide to the system. Section 13 concludes the project. This section discusses whether the system built has met the requirements laid out at the beginning of the project. It also looks at future work. Appendix A Database Appendix B Class Definitions Error! Reference source not found. 7 of 93
  8. 8. 5 BACKGROUND 5.1 Written Text Writing has long been an important means of exchanging information, ideas and concepts from one individual to another, or to a group. Indeed, it is even thought to be the single most advantageous evolutionary adaptation for species preservation [2]. The written text available contains a vast amount of information. The advent of the internet and on-line documents has contributed to the proliferation of digital textual data readily available for our perusal. Consequently, it is increasingly important to have a systematic method of organising this corpus of information. Tools for textual data mining are proving to be increasingly important to our growing mass of text based data. The discipline of computing science has provided significant contributions to this area by means of automating the data mining process. To encode unstructured text data into a more structured form is not a straightforward task. Natural language is rich and ambiguous. Working with free text is one of the most challenging areas in computer science. This project aims to investigate how computer science can help to evaluate some of the vast amounts of textual information available to us, and how to provide a convenient way to access this type of unstructured data. In particular, the focus will be on the data classification aspect of data mining. The next section will explore this topic in more depth. 5.2 Natural Language Text Classification 5.2.1 Text Classification F Sebastiani [3] described automated text categorisation as “The task of automatically sorting a set of documents into categories (or classes, or topics) from a predefined set. The task, that falls at the crossroads of information retrieval, machine learning, and (statistical) natural language processing, has witnessed a booming interest in the last ten years from researchers and developers alike.” Classification maps data into predefined groups or classes. Examples of classification applications include image and pattern recognition, medical diagnosis, loan approval, detecting faults in industry applications, and classifying financial trends. Until the late 80‟s, knowledge engineering was the dominant paradigm in automated text categorisation. Knowledge engineering consists of the manual definition of a set of rules which form part of a classifier by domain experts. Although this approach has produced results with accuracies as high as 90% [3], it is labour intensive and domain specific. The emergence of a new paradigm based on machine learning which answers many of the limitations with knowledge engineering has superseded its predecessor. Machine learning encompasses a variety of methods that represent the convergence of statistics, biological modelling, adaptive control theory, psychology, and artificial 8 of 93
  9. 9. intelligence (AI) [11]. Data classification by machine learning is a two-phase process (Figure 1). The first phase involves a general inductive process to automatically build a model by using classification algorithm that describes a predetermined set of data classes which are non-overlapping. This step is referred to as supervised learning because the classes are determined before examining the data and the set of data is known as the training data set. Data in text classification comes in the form of files and each file is often described as documents. Classification algorithms require that the classes are defined based on purely the content of the documents. They describe these classes by looking at the characteristics of the documents in the training set already known to belong to the class. The learned model constitutes the classifier and can be used to categorise future corpus samples. In the second phase, the classifier constructed in the phase one is used for classification. Machine leaning approach to text classification is less labour intensive, and is domain independent. Since the attribution of documents to categories is based purely on the content of the documents effort is thus concentrated on constructing an automatic builder of classifiers (also known as the learner), and not the classifier itself [3]. The automatic builder is a tool that extracts the characteristics from the training set which is represented by a classification model. This means that once a learner is built, new classifiers can be automatically constructed from sets of manually classified documents. Training Classification Classification Set Algorithm Model a) Classification Model Test Set New Documents b) Figure 1. a) Step One in Text Classification b) Step two in text classification 5.2.2 The Classifier In general a text classifier comprises a number of basic components. As noted in the previous section, the text classifier begins with an inductive stage. A classifier requires some sort of text representation of documents. In order to build an internal model the inductive step involves a set of examples used for training the classifier. This set of examples is known as the training set and each document in the training set is assigned to a class C = {c1, c2, … cn}. All the documents used in the training phase are transformed into internal representations. Currently, a dominant learning method in text classification is based on a vector space model [5]. The Naïve Bayesian is one example and is often used as a benchmark in text 9 of 93
  10. 10. classification experiments. Bayesian classifiers are statistical classifiers. Classification is based on the probability that a given document belongs to a particular class. The approach is „naïve‟ because it assumes that the contribution by all attributes on a given class is independent and each contributed equally to the classification problem. By analysing the contribution of each „independent‟ attribute, a conditional probability is determined. Attributes in this approach are the words that appear in the documents of the training set. Documents are represented by a vector with dimensions equal to the number of different words within the documents of the training set. The value of each individual entry within the vector is set at the frequency of the corresponding word. According to this approach, training data are used to estimate parameters of a probability distribution, and Bayes theorem is used to estimate the probability of a class. A new document is assigned to the class that yields the highest probability. It is important to perform pre-processing to remove frequent words such as stop words before a training set is used in the inductive phase. The Naïve Bayesian approach has several advantages. Firstly, it is easy to use; secondly only one scan of the training data is required. It can also easily handle missing values by simply omitting that probability when calculating the likelihoods of membership in each class. Although the Naïve Bayesian-based classifier is popular, documents are represented as a „bag-of-words‟ where words in the document have no relationships with each other. However words that appear in a document are usually not independent. Furthermore, the smallest unit of representation is a word. Research is continuously investigating how designs of text classifiers can be further improved and Pampapathi et al [1] at Birkbeck College, London recently proposed a new innovative approach to the internal modelling of text classifiers. They used a well known data structure called a suffix tree [11] which allows for indexing the characteristics of documents at a more granular level, with documents represented by substrings. The suffix tree is a compact trie containing all the suffixes of strings represented. A trie is a tree structure, where each node represents one character, and the root represents the null string. Each path from the root represents a string, described by the characters labelling the nodes traversed. All strings sharing a common prefix will branch off from a common node. When strings are words over a to z, a node has at most 26 children, one for each letter (or 27 children, plus a terminator). Suffix trees have traditionally been used for complex string matching problems in matching string sequences (data compression, DNA sequencing). Pampapathi et al‟s research is the first to apply suffix trees to natural language text classification. Pampapathi et al‟s method of constructing the suffix tree varies slightly from the standard way. Firstly, the tree nodes are labelled instead of the edges in order to associate directly the frequency with the characters and substrings. Secondly, a special terminal character is not used as the focus is on the substrings and not the suffixes. Each suffix tree has a depth. The depth is described by the maximum number of levels in the tree. A level is defined by the number of nodes away from the root node. For example the suffix tree illustrated in Figure 2 has a depth of 4. Pampapathi et al‟s sets a limit to the tree depth and each node of the suffix tree stores the frequency and the character. For example, to construct a suffix tree for the string S1 = “COOL”, the suffix tree in Figure 2 is created. The substrings are COOL; OOL; OL; and L. 10 of 93
  11. 11. C (1) O (1) O (1) L (1) Root O (1) L (1) O (1) L (1) L (1) Figure 2. Suffix Tree for String „COOL‟ If a second string S2 =”FOOL” is inserted into the suffix tree, it will look like the diagram illustrated in Figure 3. The substrings for S2 are FOOL; OOL; OL; and L. Notice that the last three substrings in S2 are duplicates of some of the substrings already seen in S1, and new nodes are not created for these repeated substrings. F (1) O (1) O (1) L (1) Root C (1) O (1) O (1) L (1) O (2) L (2) O (2) L (2) L (2) Figure 3. Suffix Tree with String „FOOL‟ Added Similar to the Naïve Bayesian method, a classifier using the suffix tree for its internal model undergoes supervised learning from a training set which contains documents that have been pre-classified into classes. Unlike the Naïve Bayesian approach, the suffix tree, by capturing the characteristics of documents at the character level, does not require pre-processing of the training set. A suffix tree is built for each class and a new document is classified by scoring it against each of the trees. The class of the highest scoring tree is assigned to the document. Pampapathi et al‟s study was based on email 11 of 93
  12. 12. classification and the result of the experiment showed that a classifier employing a suffix tree outperformed the Naïve Bayesian method. In order to solve a classification problem, not only is the classifier one of the central components, but as seen with the Naïve Bayesian method it is also important to perform pre-processing on data used for training. The next section looks at other processes involved in text classification other than the classifier component itself. 5.3 Text Classifier Experimentations As described in previous sections that there is a two-step process to classification: 1. Create a specific model by evaluating the training data. This step has as input the training data (including the category/class labels) and as output a definition of the model developed. The model created which is the classifier classifies the training data as accurately as possible. 2. Apply the model developed by classifying new sets of documents. In the research community or for those interested in evaluating the performance of a classifier the second step can be more involved. First, the predictive accuracy of the classifier is estimated. A simple yet popular technique is called the holdout method which uses a test set of class-labelled samples. These samples are usually randomly selected and it is important that they are independent of the training samples, otherwise the estimate could be optimistic since the learned model is based on that data, and therefore tend to overfit. The accuracy of a classifier on a given test set is the percentage of test set samples that are correctly classified by the classifier. For each test sample the known class label is compared with the classifier‟s class prediction for that sample. If the accuracy of the classifier model is considered as acceptable, the model can be used to classify new documents. Training Derive Estimate Set Classifier Accuracy Corpus data Test Set Figure 4. Estimating Classifier Accuracy with the Holdout Method The estimate using the holdout method is pessimistic since only a portion of the initial data is used to derive the classifier. Another technique call N-fold cross-validation is often used in research. Cross-validation is a statistical technique which can mitigate bias caused by a particular partition of training and test set. It is also useful when the amount of data is limited. The method can be used to evaluate and estimate the performance of a classifier, and the aim is to obtain as honest an estimation as possible about the classification accuracy of the system. N-fold cross-validation involves 12 of 93
  13. 13. partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping blocks/folds. Then the training-testing process is run N times, with a different test set. For example, when N=3, we will have the following training and test sets. Block 1 Train Test Run 1 1, 2 3 Block 2 Run 2 1, 3 2 Block 3 Run 3 2, 3 1 Figure 5. 3-Fold Cross-Validation For each cross-validation run the user will be able to use a training set to build the classifier. Stratified N-fold cross-validation is a recommended method for estimating classifier accuracy due to its low bias and variance [13]. In stratified cross-validation, the folds are stratified so that the class distribution of the samples in each fold is approximately the same as that of the initial training set. Preparing the training set data for classification using pre-processing can help improve the accuracy, efficiency, and scalability of the evaluation of the classification. Methods include stop word removal, punctuation removal, and stemming. The use of the above techniques to prepare the data and estimate classifier accuracy increases the overall computational time yet is useful for evaluating a classifier, and selecting among several classifiers. The current project aims to build a system which is a wrapper to a text classifier and incorporates the suffix tree that was used in the research done by Pampapathi et al as an example. The next section and beyond describes the project in detail. 13 of 93
  14. 14. 6 HIGH-LEVEL APPLICATION DESCRIPTION 6.1 Description and Rationale The aim of this project is to build a management and visualisation tool that will allow researchers to perform data manipulation support for underlying text classification algorithms. The tool will provide a software infrastructure for a data mining system based on machine learning. The goal is to build a flexible framework that would allow changes to the underlying components with relative ease. Functions maybe added to the system in the future. Adding new functionalities should have minimal effect on the current system. The system will be built as a wrapper for the two-step process involved in classification. First, a component will be built that will automatically build a classifier given some training data. Secondly, to provide capabilities to perform classification and evaluate the performance of a classifier. Additionally, the tool will provide functionalities to run data sampling and various pre-processing on data. For the researcher it is incumbent to clearly define the training set (this will be known as the „resource corpus‟ in this report) used for the training the classifier. When the resource corpus is small the user can choose to use the entire corpus in the study. If the resource corpus is large, the tool gives the option to select sampling sets to represent it. A number of sampling methodologies is implemented that allows the user to select a sample, which will reflect the characteristics of the resource corpus from which it is drawn. Note that a resource corpus is grouped into classes and this structure needs to be taken into consideration when the sampling mechanism was developed. Three popular sampling methods will be developed. Although other sampling methods can be added, such as convenience sampling, judgement sampling, quota sampling, and snowball sampling. Note that the user can choose to evaluate data used to construct the classier before actually building the classifier. The tool will be designed to be generic enough to analyse a corpus of any categorisation type e.g. automated indexing of scientific articles, emails routing, spam filtering, criminal profiling, and expertise profiling. 6.1.1 Build a Classifier The tool allows the user to build a classifier. The current framework only implements the suffix tree-based classifier developed by Birkbeck College using the suffix tree, but will be flexible enough to incorporate other classification models in the future. The research on suffix trees applied to classification is new, and there is currently no such application. The learning process of the classifier follows the machine learning approach to automated text classification, whereby the system automatically builds a classifier for the categories of interest. From the graphical user interface (GUI), the user can select a corpus to use as training data. The application provides links to .dll files developed by Birkbeck College which allow the user to build a suffix tree from the selected corpus. The internal data representation is constructed by generalising from a training set of pre- classified documents. Once the classifier is built the user can load new documents into the system to be classified. 14 of 93
  15. 15. 6.1.2 Evaluate and Refine the Classifier In research once a classifier has been built it is desirable to evaluate its effectiveness. Even before the construction of the classifier the tool provides a platform for users to perform a number of experiments and refinements on the source (training) data. Hence, the second focus of the project is to provide a user-friendly front-end and a base application for testing classification algorithms. The user can load in a text based corpus and perform standard pre-processing functions to remove noise and prepare the data for experimentation. There is also a choice of sampling methods to use in order to reduce the size of the initial corpus making it more manageable. Sebastiani [2] notes that any classifier is prone to classification error, whether the classifier is human or machine. This is due to a central notion to text classification that the membership of a document in a class based on the characteristics of the document and the class is inherently subjective, since the characteristics of both the documents and class cannot be formally specified. As a result automatic text classifiers are evaluated using a set of pre-classified documents. The accuracy of classifiers is compared to the classification decision and the original category the documents were assigned to. For experimentation and evaluation purpose, this set of pre-classified documents is split into two sets: a training set and test set, not necessarily of equal sizes. The tool implements an extra level of experimentation using n-fold cross-validation. When employing cross-validation in classification it must take into account that the data is grouped by classes therefore this project will implement stratified cross-validation. Once a classifier has been constructed, it is possible to perform data classification experiments as well as other tasks such as single document analysis. For example, for the implementation of a suffix tree-based classifier the user will be able to view the structure of the suffix tree, as well, the documents in the test sets or load a new document and obtain a full matrix of output data about it. The output data is persisted in an information system which is subsequently used to perform analysis and visualisation tasks. 6.2 Development and Technologies Development was done in C#, using the .NET framework. The architect of the system was designed to be an extensible platform to enable users and developers to leverage the existing framework for future system upgrades. The tool was built from several components and aims to be modular. There are a number of controller components to provide functionalities for the tool. A set of libraries is used to provide the functionalities for the suffix tree. Working closely with researchers from Birkbeck College on the interface, these libraries for the suffix tree were provided by Birkbeck College. The suffix tree data structure is built in memory and can become very large. One solution to better utilise resources is to have the data structure physically stored as one tree, although it is logically represented as individual trees for each class. Further discussion can be found in subsequent sections. 15 of 93
  16. 16. A Windows application was built as the client. This forms the interface that the user interacts with to gain access to the functionalities of the tool. The output data is cached in a database. The main targeted users for the tool are researchers in the research community for natural language text classification, and other users who want to mine textual data. 16 of 93
  17. 17. 7 DESIGN 7.1 Functional Requirements Requirements for the application were collected from research on natural language text classification and discussions with targeted users in the research community. Requirements are the capabilities and conditions to which the application must conform. The functional requirements of the system are captured using „use cases‟. Use cases are a useful tool in describing how a user interacts with a system. They are written stories that describe the interaction between the system and the user that is easy to understand. Requirements can often change over the course of development and for this reason there was no attempt to define and freeze all requirements from the onset of the project. The following use cases were produced. Note some use cases were added throughout the development of the system Use Case Name: Load Directory as Source Corpus Primary Actor: User Pre-conditions: The application is running Post-conditions: A source corpus is loaded into the application Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. The user selects a valid directory 2. The system checks for directory and has at least read access to the path validity and access directory, and loads it as a corpus 3. Builds a tree structure of classes into the system based on the sub-folders in the directory and displays the classes in the GUI Use Case Name: View a Document in Corpus Primary Actor: User Pre-conditions: A corpus is successfully loaded Post-conditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select the document to view 2. Display content of document in the GUI Use Case Name: Create Sampling Set 17 of 93
  18. 18. Primary Actor: User Preconditions: A source corpus is successfully loaded Postconditions: A sampling set based on the source corpus is created. New file directory created for the corpus. Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects how they want to 3. Creates a sampling set based on select the sampling set parameters given by the user 2. User specifies location to store the 4. Creates the directory structure and documents/files created for the document/files in the location sampling set specified by the user 5. Displays new corpus created in the GUI Use Case Name: Run Pre-Processing Primary Actor: User Pre-conditions: A training set exist in the system Post-conditions: A new pre-processed sampling set created. New file directory created for the corpus. Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select type of pre-processing to 4. Performs pre-processing perform 5. Creates a new pre-processed set 2. User specifies location to store the 6. Stores the directory structure and documents/files created for the pre- documents/files at the location pre-processing set specified by the user. 3. Run pre-processing 7. Displays the corpus as a directory structure in the GUI Use Case Name: Run N-Fold Cross-Validation Primary Actor: User Preconditions: A sampling set is successfully created Postconditions: N-fold cross-validation set is created virtually Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects sampling set to 2. Builds n-fold cross-validation set process and the number of fold based on parameters given by the user, which includes the n-runs, 18 of 93
  19. 19. each run containing training set and test set. 3. Displays new cross-validation set created in the GUI Use Case Name Create Classifier (Suffix Tree) Primary Actor: User Preconditions: A cross-validation set or classification set exist Postconditions: Classifier created in memory Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User actives an event to build 3. Builds classifier in memory, based classifier for a cross-validation set on the corpus set selected or classification set 4. indicate in the GUI that the 2. User choose any additional classifier of the corpus has been conditions to apply created Use Case Name: Score Documents Primary Actor: User Preconditions: An n-fold cross-validation set is created. Classifier for the corpus set is created Postconditions: Documents in the cross-validation set is scored and data stored in the database Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects the cross-validation 2. Scores all documents under the run to score selected corpus set 3. Inserts score data into database Use Case Name: Classify Documents Primary Actor: User Preconditions: An n-fold cross-validation set is created. Classifier for the set is created and the documents have been scored Postconditions: Misclassified documents in the cross-validation set is flagged Main Success Scenarios: Actor Action (or Intention) System Responsibility 19 of 93
  20. 20. 1. User selects the cross-validation 2. Classify all documents under the run to classify selected cross-validation set 3. Flag all misclassified documents in the GUI Use Case Name: Create Classification Set Primary Actor: User Preconditions: A source corpus is successfully loaded Postconditions: A classification set is created virtually Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects the corpus set they 2. Display new corpus created in the want to use to create a classifier GUI as a classification corpus set Use Case Name: Load New Document to Classify Primary Actor: User Preconditions: Cross-validation set or classification set exist Postconditions: Substring matches and relates output data is store in database Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User decides which suffix tree to 2. Document name and relevant use for classification and loads in a information is displayed in the GUI valid textual document as an item ready to be analysed to be classified and analysed 3. Score and classify document 4. Stores output data in database Use Case Name: View a Document Primary Actor: User Pre-conditions: Document loaded into the system Post-conditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select the document to view 2. Display content of document on GUI 20 of 93
  21. 21. Use Case Name View n-Gram Matches in document Primary Actor: User Preconditions: The document in concern is successfully loaded and suffix classifier created Postconditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects a string/substring in a 2. Queries the classifier to retrieve the document to match n length substring matches 3. Displays to user the frequency for the string/substring selected Use Case Name View Statistics on Matches Primary Actor: User Preconditions: Document successfully loaded, scored and output exists in database Postconditions: Displays information in GUI Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects to view output 2. System queries and retrieves relevant data in the database 3. Displays the output in table form in the GUI Use Case Name Visualise Representation of Classifier (View Suffix Tree) Primary Actor: User Preconditions: Classifier was successfully built Postconditions: Classifier visual representation displayed on GUI Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects option to display suffix 2. Builds visual representation of the tree classifier and displays in GUI 21 of 93
  22. 22. Use Case Name Delete Classifier Primary Actor: User Preconditions: Classifier was successfully built Postconditions: Classifier is deleted Main Success Scenarios: Actor Action (or Intention) System Responsibility 3. User selects classifier to delete 4. Remove classifier, and clear displayed tree in GUI 7.2 Non-Functional Requirements The non-functional requirements for the use cases are as follows. 7.2.1 Usability The user should have one main single user interface to interact with the system. The user interface should be user friendly and the complexity of computation e.g. building an n-fold cross-validation set, scoring documents against a classification model, should be hidden from the user. An experimental run of the suffix tree classifier could involve as many as 126 scoring configurations, all of which could together take some considerable time to calculate. It therefore makes sense to keep a store of all calculated scores, rather than calculate them on-the-fly whenever they are requested. The results will be cached in a data store, which is implemented as database in this project. Hence, optimizing system responsiveness. Some system requests can only be activated once a pre-condition has been satisfied e.g. the user can only score documents when the suffix tree has been created. The system should give informative warning messages if the user attempts to perform a task without pre-conditions being satisfied. Where appropriate, upon a task being performed, the system may automatically carry out pre-conditions before performing the requested task. 7.2.2 Hardware and Software Constraint The application should be easily extensible and scalable. Developers should be able to add both extra functionality and expand the workload the application can handle with relative ease. The design should consider the future enhancement of the system and should be reasonably easy to maintain and upgrade. Codes should also be well documented. The system should use an RDBMS to manage its data layer, but be independent of the RDBMS it uses to manage its data. 22 of 93
  23. 23. 7.2.3 Documentation Help menus and tool tips will be available to help users interact with the system. The application will also come with a user manual, including screen shots. The application will be available along with written documentation for its installation and configuration. 7.3 System Framework It was decided to build the system with a number of components. Each component has a specialised function in the system. Figure 6 illustrates the main components and the system boundary. The next section will describe the functions of each component in more detail and section 7.5 contains the class diagram. By isolating system responsibilities the following main components were identified. User interface Display Manager Classifier (Central Manager, STClassifier Manager, STClassifier) Sampling Set Generator Pre-processor Cross-validation Results Manager (Database Manager, OLEDB, Database) Figure 7 shows how the system is divided into a client/server architecture. The advantage of this set up is its ease of maintenance as the server implementation can be an abstraction to the client. All the functionalities of the system are accessed through the graphical user interface (GUI). The implementation is in the server, isolating users from the system complexities not relevant to the user. One of the main aims of the design of the system was to create a flexible framework. Others... The green boxes seen in Figure 8 represent new or alternative components that can be added to the system in the future with relative ease. 23 of 93
  24. 24. Input Data System Boundary Random Graphical User DisplayManager Interface Sampling Set Generator Utility Results Manager Central Manager Pre-processor OLEBD Database STClassifier Manager Manager Stemmer Database STClassifier Cross-Validation Figure 6. System Components and Boundary Input Data Graphical User Client Interface Server DisplayManager Random v Sampling Set Results Manager Central Manager Generator Utility Pre-processor Database STClassifier OLEBD Manager Manager Stemmer Database STClassifier Cross-Validation Figure 7. Client Server Division 24 of 93
  25. 25. Graphical User Others... Interface Random Others... Input Data DisplayManager Sampling Set Generator Utility Others... Results Manager Central Manager Pre-processor STClassifier Database Others... OLEBD Manager Manager Stemmer Others.. STClassifier Cross-Validation Database Figure 8. Additional or Alternative Components 7.4 Components in Detail 7.4.1 The Client - User Interface Graphical User Interface The user interacts with the system via a single graphical user interface which is also the client. In this project the client is implemented as a set of Windows forms and controls in .NET. There is one main form where users can access all the functionalities of the system. There are a number of other dialog boxes and forms to help with the navigation and interaction with the system. For example there is a Select Scoring Method form, used to request from the user the scoring methodology to use when scoring a new document. Other more generic forms such as the Select Dialog form are employed for a number of uses and do not display specific types of information (see section 10 Implementation Specifics for further discussion). The client is simply an event handler for each of the GUI controls that calls the Central Manager via the Display Manager for actual data processing. The GUI contains no implementation, but delegates to the Display Manager, thus decoupling the interface from the implementation. There is a two-way communication between the client and the Display Manager, whereby a user invokes an event and related messages are passed to the Central Manager. The Central Manager passes the messages to the Central Manager which subsequently either delegates to other more specialised controllers to handle the task, or resolves the request itself. The design of the screens was done in speaking with potential users. The user should be able to perform all the tasks described by the use cases seen earlier in the Functional Requirements section (the functions will not be reiterated here). 25 of 93
  26. 26. For this project Windows forms were chosen for the implementation because most users are familiar with the Windows form interface. It creates a familiar interface on initial interaction with the system and facilitates use of the system. In particular, the .NET framework provides a wealth of controls and functionalities, which help to build a user friendly interface and hides the complexity of the underlying workings from the user. The different components are built as separate classes and the user interface or the client can be implemented using a different methodology from Windows forms, such as command line as illustrated. Select Select Scoring Dialog Method Graphical User Command Line Interface Input Data Display Manager Figure 9. Client interface and Its Collaborating Components 7.4.2 Display Manager DisplayManager The Display Manager is a layer between the User Interface and the Central Manager and the rest of the system. It essentially passes messages between these two components. The Display Manager is responsible for information displayed back to the user and it manages also the input data. Graphical User Others... Interface Input Data DisplayManager Central Manager 7.4.3 The Classifier It was mentioned in the previous section that the Central Manager is part of the classifier. Figure 10 illustrates the classifier, which is enclosed by the red box and its connecting components. The classifier comprises of the Central Manager, a controller 26 of 93
  27. 27. that manages the underlying model of the classifier, and the underlying model itself. The Central Manager is a controller that handles the communication between all the main components in the system which communicates with the classifier. The Central Manager should provide the following functionalities: Select Sampling Set for a corpus Pre-process all documents in a corpus Run cross-validation on a corpus Create a classifier for a given corpus Score all documents in a corpus Classify all documents in a corpus Obtain classification results for a corpus There are further controller classes called by the Central Manager to provide more specialised functionalities, these are the Output Manager, Suffix Tree Manager, Sampling Set Generator, Pre-processor, and Cross-validation. When a user loads a corpus into the system it is managed by the Central Manager. If there is a request to create a sampling set for example, the Central Manager should know where the corpus is located and delegates the Sampling Set Generator the task of creating a sampling set based on parameters set by the user. Similarly, a request from the user to perform pre-processing on the corpus is delegated to the Pre-processor to carry out the task by the central manager. The various components is designed to have specialised tasks, they do not need to know where the data is located as this information is passed to the components when the Central Manger invokes a request. The Sampling Set generator does not need to know how the Pre-processor carries out its task, nor does it need to know about the Cross-validation component. The three components receive data and requests from the Central Manager, perform its task and return any information back to the Central Manager. The classifier has to be connected to an internal model. In this project the suffix tree data structure is employed to model the representation of document characteristics. As seen in Figure 10, the classifier can be implemented with different types of models such as a Naïve Bayesian or Neural Networks. There is a dual way communication between the Central Manager and the STClassifier via the STClassifier Manager. The STClassifier is a DLL library built by Birkbeck research. It provides public interfaces to: Building the representation of documents using the suffix tree data structure Training the classifier Score a document Returns classification results The STClassifier Manager controls the flow of messages between the Central Manager and the STClassifier. Responsibilities involve converting data to the format that is accepted by the STClassifier, and converting output from the STClassifier which is 27 of 93
  28. 28. passed back to the STClassifier Manager. It is essentially a wrapper class for the STClassifier. The suffix tree is built using the contents of documents in a training set. Once a suffix tree is built it will be cached in an ArrayList that is managed by the STClassifier Manager. An ArrayList is a C# collection class implemented in .NET. The suffix tree remains stored in memory until the user activates an event to delete the suffix tree. As a result the system does not need to create a suffix tree every subsequent action that references it. Hence, only methods in the STClassifier Manager are called and it is not necessary to call methods in the STClassifier. The classifier generates output data when a request is invoked to classify and score documents. These two actions can be a time consuming activities. The Central Manager decides what type of output data needs to be saved and passes the data from the classifier to the Results Manager to handle. Section Figure 13 describes the design of the Results manager. Graphical User Interface Command Line Results Manager Display Manager Sampling Set Generator Central Manager Pre-processor NBClassifier NNClassifier STClassifier Manager Manager Manager Cross-Validation NBClassifier NNClassifier STClassifier Classifier Figure 10. The Classifier and Its Collaborating Components 7.4.4 Data Manipulation and Cleansing Sampling Set Pre-processor Generator 28 of 93
  29. 29. When a corpus is loaded into the system as input data. The user can create sampling sets from the initial corpus and also prepare the data for experimentation by performing various types of pre-processing on the data. The input data is given to the classifier, which sends it to the Sampling Set Generator to handle the generation of sampling sets. Various sampling methodologies can be plugged into the Sampling Set Generator. For this project the system will implement random sampling and systematic sampling methodologies. The pre-processor provides the functionality for pre-processing data passed to it. Similarly, various methods of pre-processing can be plugged into the system with relative ease. Currently, the system provides stemming, stop word removal, and punctuation removal. In order for a method to plug into the system, a method class must implement an IMethod interface so that it guarantees the following: A method class must have a name property to return the name of the method. This is necessary, so if new methods are added to the system it will be identified by its name. A method class must have a Run method. This method is where all the work is done A set of utility classes will provide helper functionalities such as random number generator, common divisor, and file system. Systematic Random Snowball Sampling Set Generator Utility Central Manager Pre-processor Stop Word Punctuation Stemmer Others.. Removal Removal Figure 11. Data Manipulation and Cleansing Components and Its Collaborating Components 7.4.5 Experimentation Cross-Validation Setting up data for experimentation is the main responsibly of the Cross-validation class. The Central Manager passes a corpus to the Cross-validation component, which uses the data to build N-fold cross-validation sets. It divides the given set of corpus into N blocks and builds a training set and test set for each N run. The data is stored as an array that is passed back to the Central Manager. 29 of 93
  30. 30. The methods the Cross-Validation class is expected to perform are: Set the number of N-folds Run N-fold cross-validation on a given source data Return the cross-validation sets in an array data structure Central Manager Cross-Validation Figure 12. Cross-validation and Its Collaborating Components 7.4.6 Results Manager Results Manager The Results Manager handles the output of the classifier and the repository of the output. The underlying RDBMS of this project is an Access database, which is used to cache the data generated by the classifier. The OLEDB component is responsible for the direct communication with the database. This class needs to provide the basic database functionalities such as read/write/ delete in a generic fashion. It is through the Database Manager object that all communication with the OLEDB library occurs, and the data flow between the Results Manager. The Database Manager manages the OLEDB. The green boxes illustrate that the information system for the system does not necessarily has to be an Access database. The system is designed to be able to store the data using a different means with relative ease, e.g. XML files, SQL server etc. 30 of 93
  31. 31. Results Manager Central Manager XML File Database Manager Manager XML OLEDB XML File(s) Database Figure 13. Results Manager and Its Collaborating Components 7.4.7 Error Handling Adequate error handling for an end user application is essential. Displays of warnings and errors should be handled in the higher level of the system, namely by the Display manager and then displayed to the user in a reasonable fashion. Errors that occur in the other classes should be propagated to the Display Manager. All classes apart from the User Interface and the Display Manager are expected to implement an IErrorRecord interface. A class that implements this interface will guarantee that it has a property called error which returns the error message. 31 of 93
  32. 32. 7.5 Class Diagram Figure 14 shows a class diagram of the main components of the system discussed above Controllers::DisplayManager MainForm -nodeMgr : TreeViewNodeManager -tvExplorer -classifier : CentralManager -sTreeView -dbProvider : string -rtxtView -dbUserId : string -rtxtInfo -dbPassword : string -mItemAddRCorpus_Click(in sender : object, in e) -dbName : string -mitemSelectSampling_Click(in sender : object, in e) -Controlled By -dbAccessMode : string 1 -mitemPreprocess_Click(in sender : object, in e) +AddNode(in destNode : TreeNode, in nodeNames : string[], in imageIdx : TreeImages, in selectedImageIdx : TreeImages) -mitemCrossValidation_Click(in sender : object, in e) +FindNode(in selectedNode : TreeNode, in nodeName : string) : TreeNode -CreateSTree_Click(in sender : object, in e) 1..* +DisplayBlank() -DeleteSTree_Click(in sender : object, in e) +DisplayFile(in filePathname : string) -DisplaySuffixTree_Click(in sender : object, in e) +SelectSampleCorpus(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode) -AddNewDoc_Click(in sender : object, in e) +AddNewClassificationSet(in treeStructure : TreeView, in sourceNode : TreeNode, in destRoot : string) -AddClassificationSet_Click(in sender : object, in e) 1 +PerformPreprocessing(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode) -ScoreAllDoc_Click(in sender : object, in e) -PerformCrossValidation(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode) -ClassifyAllDocs_Click(in sender : object, in e) +SetupSTree(in defaultCorpus : string, in sourceFilesNode : TreeNode, in STreeNode : TreeNode) +DisplayScoresByDoc(in displayView : ListView, in sourceNode : TreeNode, in filepath : string) +ScoreAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string) +ClassifyAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string) +FlagMisClassifiedDocuments(in sourceNodePath : string, in sourceDataNode : TreeNode, in sf : int, in mn : int, in tn : int) +DeleteScores(in parentPath : string) +DeleteSTree(in STreeNode : TreeNode) +DisplaySTree(in displayTxt : Label, in diplayView : TreeView, in defaultCorpus : string, in dataSource : TreeNode, in STreeNode : TreeNode) Controllers::SampleSetGenerator +GetMatchInfo(in text : string, in STreeNode : TreeNode) : string -error : string +CleanupDatabase() -Controls -methodNames : string[] = new string[] {"Census", "Random", "Systematic"} +ErrorMessage() : string 1 1 -CodeToName(in code : int) : string +Run(in resourcePath : string, in destPath : string, in selectMethod : string) 1 -Controls +MethodNames() : string[] Classifier::CentralManager -sampler : SampleSetGenerator -preprocessor : Preprocessor Controllers::CrossValidation 1 -crossValidator : CrossValidation -folds : Array[] -dataModelMgr : SuffixTreeManager -noOfFolds : int -outputMgr : DatabaseManager -minFold : int = 2 1 -error : string -maxFold : int = 10 1 -Controls +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool -error : string +Contains(in key : string) : bool -Performs 1 +ErrorMessage() : string Output::DatabaseManager +Remove(in key : string) +CrossValidation(in folds : int) +GetClassNames(in key : string) : string[] +Run(in path : string) : Array[] -dbAccess : OLEDB +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] 1 -dbProvider : string +FoldCount() : int +ErrorMessage() : string -dbUserId : string +CentralManager() -Controls -dbPassword : string +GetModel(in key : string) : EMSTreeClassifier -dbName : string +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int Controllers::Preprocessor -ScoresTable : string = "Scores" +Sampler() : SampleSetGenerator 1 -ConfigTable : string = "Config" -stopWordFile : string +Preprocessor() : Preprocessor -ClassWeightsTable : string = "ClassWeights" -punctuationFile : string +CrossValidator() : CrossValidation -ClassifiedTable : string = "qry3a_MaxWScoreClass" -methodNames : string[] = new string[methodCount] +OutputManager() : DatabaseManager -MisClassifyFiles : string = "qry2b_MisClassifiedByFile" -error : string -MatchByClass : string = "zqry2b_matchByClass_Crosstab" +ErrorMessage() : string -error : string 1 1 +Preprocessor() -bOpen : bool -SetupMethodNames() +ErrorMessage() : string -CodeToName(in code : int) : string +DatabaseManager() +Run(in content : string, in type : string) : string +SelectScoresByFile(in parentPathNode : string, in filePath : string) : OleDbDataReader +MethodNames() : string[] +SelectMisClassifiedDocuments(in parentPathNode : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader +SelectClassifiedClass(in sourceNodePath : string, in filepath : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader 1 +DeleteScores(in ParentNodePath : string) +Provider() : string +UserId() : string +Password() : string 1 -Controls 1 -Has +DatabaseName() : string Classifier::SuffixTreeManager DataMining::StopWord 1 -createdSTreeList : SortedList -name : string -error : string -stringList : ArrayList = new ArrayList() 1 -Access Database -error : string +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool +Contains(in key : string) : bool +Name() : string Output::OLEDB +Remove(in key : string) +Run(in text : string) : string -oleDbDataAdapter : OleDbDataAdapter +GetClassNames(in key : string) : string[] +ErrorMessage() : string -oleDbConnection : OleDbConnection +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] +StopWord(in filePathName : string) -oleDbInsertCommand : OleDbCommand +ErrorMessage() : string +Add(in filePathName : string) -oleDbDeleteCommand : OleDbCommand +SuffixTreeManager() -AddWord(in targetWord : string) -oleDbUpdateCommand : OleDbCommand -AddSTreeToCache(in key : string, in sTree : EMSTreeClassifier) : bool +Clear() -oleDbSelectCommand : OleDbCommand +GetModel(in key : string) : EMSTreeClassifier +Reset() 1 -Controls +oleDbDataReader : OleDbDataReader +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int +Contains(in word : string) : bool -command : COMMAND +StringList() : ArrayList -error : string 1 -bOpen : bool +ErrorMessage() : string 1..* -Access Controllers::TreeViewNodeManager +IsOpen() : bool +InsertCommand() : string -error : string EMSTreeClassifier +DeleteCommand() : string +ErrorMessage() : string +UpdateCommand() : string -className : string[] +ChildNameExist(in TargetNode : TreeNode, in matchName : string) : bool +SelectCommand() : string -dictionary : string[] +GetClassFiles(in classFileParent : TreeNode) : FileInfo[][] +GetReader() : OleDbDataReader -dictionaryByClass : string[][] +GetChildrenNodeNames(in targetNode : TreeNode) : string[] +ExecuteCommand() : bool -mergedTree : EMSTreeClassifier.EMSTree +GetTreeNode(in targetNodeName : string, in Parentnode : TreeNode) : TreeNode -SelectReader() : OleDbDataReader +addToClass(in txt : string, in class : string) +DisplaySTree(in displayView : TreeView, in sTree : EMSTreeClassifier, in classFreqToDisplay : string[]) -UpdateReader() : OleDbDataReader +classIntToName(in classInt : int) : string +AddItemToTreeView(in root : TreeNode, in childNames : params string[]) : TreeNode -InsertReader() : OleDbDataReader +classNameToInt(in className : string) : int +AddCrossValidationSetsToTreeView(in sourceNode : TreeNode, in content : Array[]) -DeleteReader() : OleDbDataReader +classScore(in example : string, in class : string, in nsf : int, in nmnf : int, in ntnf : int) : double[,,] -PopulateRunNode(in content : Array[], in testSetNum : int, in parentNode : TreeNode) +OLEDB() +maxScore(in a : double[]) : static int -Combine(in array1 : FileInfo[][], in array2 : FileInfo[][]) : FileInfo[][] +Open(in Provider : string, in UserID : string, in Password : string, in DatabaseName : string, in Mode : string) +setDepth(in d : int) +AddItem(in destNode : TreeNode, in newNodeName : string, in imageIdx : TreeImages) : TreeNode +Close() +train(in classTrainingFiles : <unspecified>[][]) : bool -CreateNewNode(in nodeName : string, in imageIdx : TreeImages) : TreeNode Figure 14. Class Diagram 32 of 93
  33. 33. 8 DATABASE 8.1 Entities All the data in the system is stored in an Access database. The following describes the organisation of the data that the system will store. 8.1.1 Score Table When a user calls to score a new document or a set of documents, each document is scored against 126 configurations for each class. The data is cached in the score table. 8.1.2 Source Table The source table stores the location properties of documents. This includes the physical pathname of the document and where it is logically located in the display tree. 8.1.3 Configuration Table This configuration table stores the 126 combination of scoring methods used in Pampapathi et al‟s study. Each configuration consists of a type of scoring function, match normalisation, and tree normalisation function. 8.1.4 Score Functions Table 33 of 93
  34. 34. This table contains the name description of score functions. 8.1.5 Match Normalisation Functions Table This table contains the name description of match normalisation functions. 8.1.6 Tree Normalisation Functions Table This table contains the name description of tree normalisation functions. 8.1.7 Classification Condition Table This table stores any classification conditions to be considered when classifying a document from a particular corpus. 8.1.8 Class Weights Table This table stores the class weights when classifying documents. 8.1.9 Temporary Max and Min Score Table 34 of 93
  35. 35. This is a temporary table used to cache the maximum and minimum scores for a class grouped by document, configuration. 8.2 Views The following are some of the main views to assist in querying the main tables for data displayed in the user interface. 8.2.1 Weighted Scores This view obtains the weighted scores by documents and scoring configuration. 8.2.2 Maximum and Minimum Scores This view obtains the maximum and minimum score by document and scoring configuration. 8.2.3 Misclassified Documents This view obtains the misclassified documents and related data. 8.3 Relation Design for the Main Tables The main table of the database is the Scores table. This table contains the scores for each document, scored by different configuration combinations (see the Implementation 35 of 93
  36. 36. section for scoring configuration description). Figure 15 shows the relationships between the main tables. tTreeNormalisation tMatchNormalisation tScoreFunction PK Index PK Index PK Index Name Name Name 1..1 Config PK,I1 ConfigId 1..1 1..1 FK2 SF FK3 MN FK1 TN SF Name MN Name TN Name *..1 tempMaxMinWScores Source FK2,I2 SourceId *..1 FK1,I1 ConfigId PK SourceId True Class Node Parent Path MaxOfWScore Node Path MinOfWScore File Path Scores *..1 PK ScoreId *..1 FK2,I4,I3 SourceId FK1,I2,I1 ConfigId Score Class True Class Score Figure 15. Table Relations 36 of 93
  37. 37. 9 IMPLEMENTATION Due to the large size of the program, this report will not cover all the different implementation details, but instead the discussion will focus on the main classes and highlight some specific implementation. See Appendix B Class Definitions. 9.1 Main User Interface The main form of the user interface is divided into four resizable panes which each display different types of information to the user (see Figure 16): tvExplorer rtxtView/sTreeView. lblTreeDetail/listView rTxtInfo The tvExplorer is a Windows Form TreeView control, which displays the different corpuses available in the system. The information is presented as a hierarchy of nodes, like the way files and folders are displayed in the left pane of Windows Explorer. The rtxtView is implemented as a Windows Forms RichTextBox control. When the user selects a child node in tvExplorer that represents a document, rtxtView will display the content of document. The rtxtView will also allow users to perform dynamic n-gram (sub-string) matching on a document (see section 10.3 Dynamic Sub-String Matching). The sTreeView is implemented as a TreeView control. It shares the same pane as the rtxtView control and is only made visible on the main form (and the rtxtView becomes invisible) when the user requests to display a suffix tree that has been created. At the same time the lblSTreeDetail control, which is implemented as a Windows Form Label control will display description about the suffix tree currently displayed in the sTreeView control. ListView is a Windows Form ListView control which provides information related to the current content of the rtxtView control. RtxtInfo is a RichText control and displays classification summary regarding a document. 37 of 93
  38. 38. lblSTreeDetail/listView tvExplorer rtxtInfo rtxtView/sTreeView Figure 16. Main User Interface The main form is implemented as a .NET class called MainForm. Figure 17 shows the class members and class interface. Note that there are other Windows Form control classes which were implemented to control the flow of user-system interaction. Section 10 Implementation Specifics will describe one of them in detail, and see Appendix x for all the user interface classes. 38 of 93

×