Your SlideShare is downloading. ×
A Management and Visualisation Tool for

                     Text Mining Applications

                                  ...
1 TABLE OF CONTENTS

1 TABLE OF CONTENTS.....................................................................................
7.4.7 Error Handling.........................................................................................................
11 USER GUIDE...........................................................................................................53...
2 ACKNOWLEDGEMENT

I would like to thank the following people for their help over the course of this project:


Rajesh Pam...
3 ABSTRACT

This report describes the design and implementation of a management and visualisation
tool for text classifica...
4 INTRODUCTION


This report describes the project carried out to implement a management and
visualisation tool for text c...
5 BACKGROUND


5.1     Written Text


Writing has long been an important means of exchanging information, ideas and
concep...
intelligence (AI) [11]. Data classification by machine learning is a two-phase process
(5.2.1). The first phase involves a...
approach is ‘naïve’ because it assumes that the contribution by all attributes on a given
class is independent and each co...
C (1)       O (1)     O (1)       L (1)




       Root
                                  O (1)     L (1)

               ...
classification and the result of the experiment showed that a classifier employing a suffix
tree outperformed the Naïve Ba...
partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping
blocks/folds. Then the training-te...
6 HIGH-LEVEL APPLICATION DESCRIPTION

6.1     Description and Rationale
The aim of this project is to build a management a...
6.1.2    Evaluate and Refine the Classifier
In research once a classifier has been built it is desirable to evaluate its e...
A Windows application was built as the client. This forms the interface that the user
interacts with to gain access to the...
7 DESIGN

7.1     Functional Requirements
Requirements for the application were collected from research on natural languag...
Primary Actor:                User
Preconditions:                A source corpus is successfully loaded
Postconditions:   ...
each run containing training set and
                                                         test set.
                  ...
1. User selects the cross-validation                2. Classify all documents under the
      run to classify             ...
Use Case Name                 View n-Gram Matches in document
Primary Actor:                User
Preconditions:           ...
Use Case Name                 Delete Classifier
Primary Actor:                User
Preconditions:                Classifie...
7.2.3      Documentation
Help menus and tool tips will be available to help users interact with the system. The
applicatio...
Input Data
                                                                                         System Boundary


    ...
Graphical User
                                                                                  Others...
               ...
For this project Windows forms were chosen for the implementation because most users
are familiar with the Windows form in...
connecting components. The classifier comprises of the Central Manager, a controller
that manages the underlying model of ...
The STClassifier Manager controls the flow of messages between the Central Manager
and the STClassifier. Responsibilities ...
Sampling Set
                              Pre-processor
    Generator



When a corpus is loaded into the system as input...
Setting up data for experimentation is the main responsibly of the Cross-validation class.
The Central Manager passes a co...
Results Manager                         Central Manager




    XML File            Database
    Manager             Manag...
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
msword
Upcoming SlideShare
Loading in...5
×

msword

406

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
406
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "msword"

  1. 1. A Management and Visualisation Tool for Text Mining Applications Student Peishan Mao MSc Computing Science Project Report School of Computing Science and Information System Birkbeck College, University of London 2005 Status Draft Last saved 26 Apr. 10 1 of 93
  2. 2. 1 TABLE OF CONTENTS 1 TABLE OF CONTENTS...............................................................................................2 2 ACKNOWLEDGEMENT..............................................................................................5 3 ABSTRACT.................................................................................................................6 4 INTRODUCTION..........................................................................................................7 5 BACKGROUND...........................................................................................................8 5.1 Written Text...........................................................................................................................8 5.2 Natural Language Text Classification..................................................................................8 5.2.1 Text Classification............................................................................................................8 5.2.2 The Classifier....................................................................................................................9 5.3 Text Classifier Experimentations........................................................................................12 6 HIGH-LEVEL APPLICATION DESCRIPTION...........................................................14 6.1 Description and Rationale...................................................................................................14 6.1.1 Build a Classifier.............................................................................................................14 6.1.2 Evaluate and Refine the Classifier..................................................................................15 6.2 Development and Technologies...........................................................................................15 7 DESIGN.....................................................................................................................17 7.1 Functional Requirements....................................................................................................17 7.2 Non-Functional Requirements............................................................................................22 7.2.1 Usability..........................................................................................................................22 7.2.2 Hardware and Software Constraint ................................................................................22 7.2.3 Documentation................................................................................................................23 7.3 System Framework..............................................................................................................23 7.4 Components in Detail...........................................................................................................25 7.4.1 The Client - User Interface..............................................................................................25 7.4.2 Display Manager.............................................................................................................26 7.4.3 The Classifier..................................................................................................................26 7.4.4 Data Manipulation and Cleansing...................................................................................28 ...............................................................................................................................................28 7.4.5 Experimentation..............................................................................................................29 7.4.6 Results Manager..............................................................................................................30 2 of 93
  3. 3. 7.4.7 Error Handling................................................................................................................31 7.5 Class Diagram......................................................................................................................32 8 DATABASE...............................................................................................................33 8.1 Entities .................................................................................................................................33 8.1.1 Score Table ....................................................................................................................33 8.1.2 Source Table ..................................................................................................................33 8.1.3 Configuration Table .......................................................................................................33 8.1.4 Score Functions Table.....................................................................................................33 8.1.5 Match Normalisation Functions Table............................................................................34 8.1.6 Tree Normalisation Functions Table...............................................................................34 8.1.7 Classification Condition Table........................................................................................34 8.1.8 Class Weights Table........................................................................................................34 8.1.9 Temporary Max and Min Score Table............................................................................34 8.2 Views.....................................................................................................................................35 8.2.1 Weighted Scores.............................................................................................................35 8.2.2 Maximum and Minimum Scores.....................................................................................35 8.2.3 Misclassified Documents................................................................................................35 8.3 Relation Design for the Main Tables..................................................................................35 9 IMPLEMENTATION...................................................................................................37 9.1 Main User Interface.............................................................................................................37 9.2 Display Manager..................................................................................................................39 9.3 Classifier Classes..................................................................................................................40 9.4 Results Output Classes........................................................................................................41 9.5 Other Controller Classes.....................................................................................................43 9.6 TreeView Controller Class..................................................................................................44 9.7 Error Interface.....................................................................................................................45 10 IMPLEMENTATION SPECIFICS.............................................................................46 10.1 Generic Selection Form Class...........................................................................................46 10.2 Visualisation of the Suffix Tree.........................................................................................48 10.3 Dynamic Sub-String Matching..........................................................................................49 10.4 User Interaction Warnings................................................................................................50 3 of 93
  4. 4. 11 USER GUIDE...........................................................................................................53 11.1 Getting Started...................................................................................................................53 11.1.1 Input Data.....................................................................................................................53 11.2 Loading a Resource Corpus .............................................................................................54 11.3 Selecting a Sampling Set....................................................................................................57 11.4 Performing Pre-processing................................................................................................61 11.5 Running N-Fold Cross-Validation....................................................................................64 11.5.1 Set Up Cross-Validation Set..........................................................................................64 11.5.2 Perform experiments on the data...................................................................................67 11.5.2.1 Create the Suffix Tree............................................................................................67 11.5.2.2 Display Suffix Tree................................................................................................69 11.5.2.3 Delete Suffix Tree..................................................................................................71 11.5.2.4 N-Gram Matching..................................................................................................71 11.5.2.5 Score Documents...................................................................................................73 11.5.2.6 Classify documents................................................................................................74 11.5.2.7 Add New Document to Classify.............................................................................76 11.6 Creating a Classifier..........................................................................................................79 12 TESTING..................................................................................................................81 13 CONCLUSION.........................................................................................................83 13.1 Evaluation...........................................................................................................................83 13.2 Future Work.......................................................................................................................84 14 BIBLIOGRAPHY......................................................................................................86 15 APPENDIX A DATABASE.......................................................................................88 16 APPENDIX B CLASS DEFINITIONS.......................................................................90 17 APPENDIX C SOURCE CODE................................................................................93 4 of 93
  5. 5. 2 ACKNOWLEDGEMENT I would like to thank the following people for their help over the course of this project: Rajesh Pampapathi: for his spectrum of help on the project, ranging from his patient and advice on the whole area of text classification, and pointing me in the right direction for information on the topic to being interviewed as a potential user to the proposed system as part of the requirement collection. Timothy Yip: for laboriously proof reading the draft for the report despite not having much interest in information technology. 5 of 93
  6. 6. 3 ABSTRACT This report describes the design and implementation of a management and visualisation tool for text classification applications. The system is built as a wrapper for machine learning classification tool. It aims to provide a flexible framework to accommodate for future changes to the system. The system is implemented in C# .Net with a Windows Forms front end and an Access Database as an example, but should be flexible enough to add different underlying components. 6 of 93
  7. 7. 4 INTRODUCTION This report describes the project carried out to implement a management and visualisation tool for text classification. It covers background information about the project, the design, implementation and conclusion. The report is organised as follows: Section 4 this section. It describes the organisation of the report. Section 5 takes a look at the background of the project. This section covers discussion on natural language classification, and suffix tree data structure used in Pampapathi et al’s study. Section 6 a high-level description and rationale of the system. Section 7 describes the design of the system. Lays out the system requirements, system framework, and describes system components and classes. Section 8 explains the database design and description of the database entities and table relations. Section 9 discusses how the system was implemented and goes into class definitions. Section 10 focuses on specific system implementations and looks at the implementation of the generic selection form class, visualisation of the suffix tree, dynamic sub-string matching on documents, and user warnings. Section 11 is the user guide to the system. Section 13 concludes the project. This section discusses whether the system built has met the requirements laid out at the beginning of the project. It also looks at future work. Appendix A Database Appendix B Class Definitions 7 of 93
  8. 8. 5 BACKGROUND 5.1 Written Text Writing has long been an important means of exchanging information, ideas and concepts from one individual to another, or to a group. Indeed, it is even thought to be the single most advantageous evolutionary adaptation for species preservation [2]. The written text available contains a vast amount of information. The advent of the internet and on-line documents has contributed to the proliferation of digital textual data readily available for our perusal. Consequently, it is increasingly important to have a systematic method of organising this corpus of information. Tools for textual data mining are proving to be increasingly important to our growing mass of text based data. The discipline of computing science has provided significant contributions to this area by means of automating the data mining process. To encode unstructured text data into a more structured form is not a straightforward task. Natural language is rich and ambiguous. Working with free text is one of the most challenging areas in computer science. This project aims to investigate how computer science can help to evaluate some of the vast amounts of textual information available to us, and how to provide a convenient way to access this type of unstructured data. In particular, the focus will be on the data classification aspect of data mining. The next section will explore this topic in more depth. 5.2 Natural Language Text Classification 5.2.1 Text Classification F Sebastiani [3] described automated text categorisation as “The task of automatically sorting a set of documents into categories (or classes, or topics) from a predefined set. The task, that falls at the crossroads of information retrieval, machine learning, and (statistical) natural language processing, has witnessed a booming interest in the last ten years from researchers and developers alike.” Classification maps data into predefined groups or classes. Examples of classification applications include image and pattern recognition, medical diagnosis, loan approval, detecting faults in industry applications, and classifying financial trends. Until the late 80’s, knowledge engineering was the dominant paradigm in automated text categorisation. Knowledge engineering consists of the manual definition of a set of rules which form part of a classifier by domain experts. Although this approach has produced results with accuracies as high as 90% [3], it is labour intensive and domain specific. The emergence of a new paradigm based on machine learning which answers many of the limitations with knowledge engineering has superseded its predecessor. Machine learning encompasses a variety of methods that represent the convergence of statistics, biological modelling, adaptive control theory, psychology, and artificial 8 of 93
  9. 9. intelligence (AI) [11]. Data classification by machine learning is a two-phase process (5.2.1). The first phase involves a general inductive process to automatically build a model by using classification algorithm that describes a predetermined set of data classes which are non-overlapping. This step is referred to as supervised learning because the classes are determined before examining the data and the set of data is known as the training data set. Data in text classification comes in the form of files and each file is often described as documents. Classification algorithms require that the classes are defined based on purely the content of the documents. They describe these classes by looking at the characteristics of the documents in the training set already known to belong to the class. The learned model constitutes the classifier and can be used to categorise future corpus samples. In the second phase, the classifier constructed in the phase one is used for classification. Machine leaning approach to text classification is less labour intensive, and is domain independent. Since the attribution of documents to categories is based purely on the content of the documents effort is thus concentrated on constructing an automatic builder of classifiers (also known as the learner), and not the classifier itself [3]. The automatic builder is a tool that extracts the characteristics from the training set which is represented by a classification model. This means that once a learner is built, new classifiers can be automatically constructed from sets of manually classified documents. Training Classification Classification Set Algorithm Model a) Classification Model Test Set New Documents b) Figure 1.a) Step One in Text Classification b) Step two in text classification 5.2.2 The Classifier In general a text classifier comprises a number of basic components. As noted in the previous section, the text classifier begins with an inductive stage. A classifier requires some sort of text representation of documents. In order to build an internal model the inductive step involves a set of examples used for training the classifier. This set of examples is known as the training set and each document in the training set is assigned to a class C = {c1, c2, … cn}. All the documents used in the training phase are transformed into internal representations. Currently, a dominant learning method in text classification is based on a vector space model [5]. The Naïve Bayesian is one example and is often used as a benchmark in text classification experiments. Bayesian classifiers are statistical classifiers. Classification is based on the probability that a given document belongs to a particular class. The 9 of 93
  10. 10. approach is ‘naïve’ because it assumes that the contribution by all attributes on a given class is independent and each contributed equally to the classification problem. By analysing the contribution of each ‘independent’ attribute, a conditional probability is determined. Attributes in this approach are the words that appear in the documents of the training set. Documents are represented by a vector with dimensions equal to the number of different words within the documents of the training set. The value of each individual entry within the vector is set at the frequency of the corresponding word. According to this approach, training data are used to estimate parameters of a probability distribution, and Bayes theorem is used to estimate the probability of a class. A new document is assigned to the class that yields the highest probability. It is important to perform pre-processing to remove frequent words such as stop words before a training set is used in the inductive phase. The Naïve Bayesian approach has several advantages. Firstly, it is easy to use; secondly only one scan of the training data is required. It can also easily handle missing values by simply omitting that probability when calculating the likelihoods of membership in each class. Although the Naïve Bayesian-based classifier is popular, documents are represented as a ‘bag-of-words’ where words in the document have no relationships with each other. However words that appear in a document are usually not independent. Furthermore, the smallest unit of representation is a word. Research is continuously investigating how designs of text classifiers can be further improved and Pampapathi et al [1] at Birkbeck College, London recently proposed a new innovative approach to the internal modelling of text classifiers. They used a well known data structure called a suffix tree [11] which allows for indexing the characteristics of documents at a more granular level, with documents represented by substrings. The suffix tree is a compact trie containing all the suffixes of strings represented. A trie is a tree structure, where each node represents one character, and the root represents the null string. Each path from the root represents a string, described by the characters labelling the nodes traversed. All strings sharing a common prefix will branch off from a common node. When strings are words over a to z, a node has at most 26 children, one for each letter (or 27 children, plus a terminator). Suffix trees have traditionally been used for complex string matching problems in matching string sequences (data compression, DNA sequencing). Pampapathi et al’s research is the first to apply suffix trees to natural language text classification. Pampapathi et al’s method of constructing the suffix tree varies slightly from the standard way. Firstly, the tree nodes are labelled instead of the edges in order to associate directly the frequency with the characters and substrings. Secondly, a special terminal character is not used as the focus is on the substrings and not the suffixes. Each suffix tree has a depth. The depth is described by the maximum number of levels in the tree. A level is defined by the number of nodes away from the root node. For example the suffix tree illustrated in 5.2.3 has a depth of 4. Pampapathi et al’s sets a limit to the tree depth and each node of the suffix tree stores the frequency and the character. For example, to construct a suffix tree for the string S1 = “COOL”, the suffix tree in 5.2.3 is created. The substrings are COOL; OOL; OL; and L. 10 of 93
  11. 11. C (1) O (1) O (1) L (1) Root O (1) L (1) O (1) L (1) L (1) Figure 2.Suffix Tree for String ‘COOL’ If a second string S2 =”FOOL” is inserted into the suffix tree, it will look like the diagram illustrated in 5.2.3. The substrings for S2 are FOOL; OOL; OL; and L. Notice that the last three substrings in S2 are duplicates of some of the substrings already seen in S1, and new nodes are not created for these repeated substrings. F (1) O (1) O (1) L (1) Root C (1) O (1) O (1) L (1) O (2) L (2) O (2) L (2) L (2) Figure 3.Suffix Tree with String ‘FOOL’ Added Similar to the Naïve Bayesian method, a classifier using the suffix tree for its internal model undergoes supervised learning from a training set which contains documents that have been pre-classified into classes. Unlike the Naïve Bayesian approach, the suffix tree, by capturing the characteristics of documents at the character level, does not require pre-processing of the training set. A suffix tree is built for each class and a new document is classified by scoring it against each of the trees. The class of the highest scoring tree is assigned to the document. Pampapathi et al’s study was based on email 11 of 93
  12. 12. classification and the result of the experiment showed that a classifier employing a suffix tree outperformed the Naïve Bayesian method. In order to solve a classification problem, not only is the classifier one of the central components, but as seen with the Naïve Bayesian method it is also important to perform pre-processing on data used for training. The next section looks at other processes involved in text classification other than the classifier component itself. 5.3 Text Classifier Experimentations As described in previous sections that there is a two-step process to classification: 1. Create a specific model by evaluating the training data. This step has as input the training data (including the category/class labels) and as output a definition of the model developed. The model created which is the classifier classifies the training data as accurately as possible. 2. Apply the model developed by classifying new sets of documents. In the research community or for those interested in evaluating the performance of a classifier the second step can be more involved. First, the predictive accuracy of the classifier is estimated. A simple yet popular technique is called the holdout method which uses a test set of class-labelled samples. These samples are usually randomly selected and it is important that they are independent of the training samples, otherwise the estimate could be optimistic since the learned model is based on that data, and therefore tend to overfit. The accuracy of a classifier on a given test set is the percentage of test set samples that are correctly classified by the classifier. For each test sample the known class label is compared with the classifier’s class prediction for that sample. If the accuracy of the classifier model is considered as acceptable, the model can be used to classify new documents. Training Derive Estimate Set Classifier Accuracy Corpus data Test Set Figure 4.Estimating Classifier Accuracy with the Holdout Method The estimate using the holdout method is pessimistic since only a portion of the initial data is used to derive the classifier. Another technique call N-fold cross-validation is often used in research. Cross-validation is a statistical technique which can mitigate bias caused by a particular partition of training and test set. It is also useful when the amount of data is limited. The method can be used to evaluate and estimate the performance of a classifier, and the aim is to obtain as honest an estimation as possible about the classification accuracy of the system. N-fold cross-validation involves 12 of 93
  13. 13. partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping blocks/folds. Then the training-testing process is run N times, with a different test set. For example, when N=3, we will have the following training and test sets. Block 1 Train Test Run 1 1, 2 3 Block 2 Run 2 1, 3 2 Block 3 Run 3 2, 3 1 Figure 5.3-Fold Cross-Validation For each cross-validation run the user will be able to use a training set to build the classifier. Stratified N-fold cross-validation is a recommended method for estimating classifier accuracy due to its low bias and variance [13]. In stratified cross-validation, the folds are stratified so that the class distribution of the samples in each fold is approximately the same as that of the initial training set. Preparing the training set data for classification using pre-processing can help improve the accuracy, efficiency, and scalability of the evaluation of the classification. Methods include stop word removal, punctuation removal, and stemming. The use of the above techniques to prepare the data and estimate classifier accuracy increases the overall computational time yet is useful for evaluating a classifier, and selecting among several classifiers. The current project aims to build a system which is a wrapper to a text classifier and incorporates the suffix tree that was used in the research done by Pampapathi et al as an example. The next section and beyond describes the project in detail. 13 of 93
  14. 14. 6 HIGH-LEVEL APPLICATION DESCRIPTION 6.1 Description and Rationale The aim of this project is to build a management and visualisation tool that will allow researchers to perform data manipulation support for underlying text classification algorithms. The tool will provide a software infrastructure for a data mining system based on machine learning. The goal is to build a flexible framework that would allow changes to the underlying components with relative ease. Functions maybe added to the system in the future. Adding new functionalities should have minimal effect on the current system. The system will be built as a wrapper for the two-step process involved in classification. First, a component will be built that will automatically build a classifier given some training data. Secondly, to provide capabilities to perform classification and evaluate the performance of a classifier. Additionally, the tool will provide functionalities to run data sampling and various pre-processing on data. For the researcher it is incumbent to clearly define the training set (this will be known as the ‘resource corpus’ in this report) used for the training the classifier. When the resource corpus is small the user can choose to use the entire corpus in the study. If the resource corpus is large, the tool gives the option to select sampling sets to represent it. A number of sampling methodologies is implemented that allows the user to select a sample, which will reflect the characteristics of the resource corpus from which it is drawn. Note that a resource corpus is grouped into classes and this structure needs to be taken into consideration when the sampling mechanism was developed. Three popular sampling methods will be developed. Although other sampling methods can be added, such as convenience sampling, judgement sampling, quota sampling, and snowball sampling. Note that the user can choose to evaluate data used to construct the classier before actually building the classifier. The tool will be designed to be generic enough to analyse a corpus of any categorisation type e.g. automated indexing of scientific articles, emails routing, spam filtering, criminal profiling, and expertise profiling. 6.1.1 Build a Classifier The tool allows the user to build a classifier. The current framework only implements the suffix tree-based classifier developed by Birkbeck College using the suffix tree, but will be flexible enough to incorporate other classification models in the future. The research on suffix trees applied to classification is new, and there is currently no such application. The learning process of the classifier follows the machine learning approach to automated text classification, whereby the system automatically builds a classifier for the categories of interest. From the graphical user interface (GUI), the user can select a corpus to use as training data. The application provides links to .dll files developed by Birkbeck College which allow the user to build a suffix tree from the selected corpus. The internal data representation is constructed by generalising from a training set of pre- classified documents. Once the classifier is built the user can load new documents into the system to be classified. 14 of 93
  15. 15. 6.1.2 Evaluate and Refine the Classifier In research once a classifier has been built it is desirable to evaluate its effectiveness. Even before the construction of the classifier the tool provides a platform for users to perform a number of experiments and refinements on the source (training) data. Hence, the second focus of the project is to provide a user-friendly front-end and a base application for testing classification algorithms. The user can load in a text based corpus and perform standard pre-processing functions to remove noise and prepare the data for experimentation. There is also a choice of sampling methods to use in order to reduce the size of the initial corpus making it more manageable. Sebastiani [2] notes that any classifier is prone to classification error, whether the classifier is human or machine. This is due to a central notion to text classification that the membership of a document in a class based on the characteristics of the document and the class is inherently subjective, since the characteristics of both the documents and class cannot be formally specified. As a result automatic text classifiers are evaluated using a set of pre-classified documents. The accuracy of classifiers is compared to the classification decision and the original category the documents were assigned to. For experimentation and evaluation purpose, this set of pre-classified documents is split into two sets: a training set and test set, not necessarily of equal sizes. The tool implements an extra level of experimentation using n-fold cross-validation. When employing cross-validation in classification it must take into account that the data is grouped by classes therefore this project will implement stratified cross-validation. Once a classifier has been constructed, it is possible to perform data classification experiments as well as other tasks such as single document analysis. For example, for the implementation of a suffix tree-based classifier the user will be able to view the structure of the suffix tree, as well, the documents in the test sets or load a new document and obtain a full matrix of output data about it. The output data is persisted in an information system which is subsequently used to perform analysis and visualisation tasks. 6.2 Development and Technologies Development was done in C#, using the .NET framework. The architect of the system was designed to be an extensible platform to enable users and developers to leverage the existing framework for future system upgrades. The tool was built from several components and aims to be modular. There are a number of controller components to provide functionalities for the tool. A set of libraries is used to provide the functionalities for the suffix tree. Working closely with researchers from Birkbeck College on the interface, these libraries for the suffix tree were provided by Birkbeck College. The suffix tree data structure is built in memory and can become very large. One solution to better utilise resources is to have the data structure physically stored as one tree, although it is logically represented as individual trees for each class. Further discussion can be found in subsequent sections. 15 of 93
  16. 16. A Windows application was built as the client. This forms the interface that the user interacts with to gain access to the functionalities of the tool. The output data is cached in a database. The main targeted users for the tool are researchers in the research community for natural language text classification, and other users who want to mine textual data. 16 of 93
  17. 17. 7 DESIGN 7.1 Functional Requirements Requirements for the application were collected from research on natural language text classification and discussions with targeted users in the research community. Requirements are the capabilities and conditions to which the application must conform. The functional requirements of the system are captured using ‘use cases’. Use cases are a useful tool in describing how a user interacts with a system. They are written stories that describe the interaction between the system and the user that is easy to understand. Requirements can often change over the course of development and for this reason there was no attempt to define and freeze all requirements from the onset of the project. The following use cases were produced. Note some use cases were added throughout the development of the system Use Case Name: Load Directory as Source Corpus Primary Actor: User Pre-conditions: The application is running Post-conditions: A source corpus is loaded into the application Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. The user selects a valid directory 2. The system checks for directory and has at least read access to the path validity and access directory, and loads it as a corpus 3. Builds a tree structure of classes into the system based on the sub-folders in the directory and displays the classes in the GUI Use Case Name: View a Document in Corpus Primary Actor: User Pre-conditions: A corpus is successfully loaded Post-conditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select the document to view 2. Display content of document in the GUI Use Case Name: Create Sampling Set 17 of 93
  18. 18. Primary Actor: User Preconditions: A source corpus is successfully loaded Postconditions: A sampling set based on the source corpus is created. New file directory created for the corpus. Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects how they want to 3. Creates a sampling set based on select the sampling set parameters given by the user 2. User specifies location to store the 4. Creates the directory structure and documents/files created for the document/files in the location sampling set specified by the user 5. Displays new corpus created in the GUI Use Case Name: Run Pre-Processing Primary Actor: User Pre-conditions: A training set exist in the system Post-conditions: A new pre-processed sampling set created. New file directory created for the corpus. Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select type of pre-processing to 4. Performs pre-processing perform 5. Creates a new pre-processed set 2. User specifies location to store the 6. Stores the directory structure and documents/files created for the pre- documents/files at the location pre-processing set specified by the user. 3. Run pre-processing 7. Displays the corpus as a directory structure in the GUI Use Case Name: Run N-Fold Cross-Validation Primary Actor: User Preconditions: A sampling set is successfully created Postconditions: N-fold cross-validation set is created virtually Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects sampling set to 2. Builds n-fold cross-validation set process and the number of fold based on parameters given by the user, which includes the n-runs, 18 of 93
  19. 19. each run containing training set and test set. 3. Displays new cross-validation set created in the GUI Use Case Name Create Classifier (Suffix Tree) Primary Actor: User Preconditions: A cross-validation set or classification set exist Postconditions: Classifier created in memory Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User actives an event to build 3. Builds classifier in memory, based classifier for a cross-validation set on the corpus set selected or classification set 4. indicate in the GUI that the 2. User choose any additional classifier of the corpus has been conditions to apply created Use Case Name: Score Documents Primary Actor: User Preconditions: An n-fold cross-validation set is created. Classifier for the corpus set is created Postconditions: Documents in the cross-validation set is scored and data stored in the database Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects the cross-validation 2. Scores all documents under the run to score selected corpus set 3. Inserts score data into database Use Case Name: Classify Documents Primary Actor: User Preconditions: An n-fold cross-validation set is created. Classifier for the set is created and the documents have been scored Postconditions: Misclassified documents in the cross-validation set is flagged Main Success Scenarios: Actor Action (or Intention) System Responsibility 19 of 93
  20. 20. 1. User selects the cross-validation 2. Classify all documents under the run to classify selected cross-validation set 3. Flag all misclassified documents in the GUI Use Case Name: Create Classification Set Primary Actor: User Preconditions: A source corpus is successfully loaded Postconditions: A classification set is created virtually Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects the corpus set they 2. Display new corpus created in the want to use to create a classifier GUI as a classification corpus set Use Case Name: Load New Document to Classify Primary Actor: User Preconditions: Cross-validation set or classification set exist Postconditions: Substring matches and relates output data is store in database Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User decides which suffix tree to 2. Document name and relevant use for classification and loads in a information is displayed in the GUI valid textual document as an item ready to be analysed to be classified and analysed 3. Score and classify document 4. Stores output data in database Use Case Name: View a Document Primary Actor: User Pre-conditions: Document loaded into the system Post-conditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select the document to view 2. Display content of document on GUI 20 of 93
  21. 21. Use Case Name View n-Gram Matches in document Primary Actor: User Preconditions: The document in concern is successfully loaded and suffix classifier created Postconditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects a string/substring in a 2. Queries the classifier to retrieve the document to match n length substring matches 3. Displays to user the frequency for the string/substring selected Use Case Name View Statistics on Matches Primary Actor: User Preconditions: Document successfully loaded, scored and output exists in database Postconditions: Displays information in GUI Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects to view output 2. System queries and retrieves relevant data in the database 3. Displays the output in table form in the GUI Use Case Name Visualise Representation of Classifier (View Suffix Tree) Primary Actor: User Preconditions: Classifier was successfully built Postconditions: Classifier visual representation displayed on GUI Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects option to display suffix 2. Builds visual representation of the tree classifier and displays in GUI 21 of 93
  22. 22. Use Case Name Delete Classifier Primary Actor: User Preconditions: Classifier was successfully built Postconditions: Classifier is deleted Main Success Scenarios: Actor Action (or Intention) System Responsibility 3. User selects classifier to delete 4. Remove classifier, and clear displayed tree in GUI 7.2 Non-Functional Requirements The non-functional requirements for the use cases are as follows. 7.2.1 Usability The user should have one main single user interface to interact with the system. The user interface should be user friendly and the complexity of computation e.g. building an n-fold cross-validation set, scoring documents against a classification model, should be hidden from the user. An experimental run of the suffix tree classifier could involve as many as 126 scoring configurations, all of which could together take some considerable time to calculate. It therefore makes sense to keep a store of all calculated scores, rather than calculate them on-the-fly whenever they are requested. The results will be cached in a data store, which is implemented as database in this project. Hence, optimizing system responsiveness. Some system requests can only be activated once a pre-condition has been satisfied e.g. the user can only score documents when the suffix tree has been created. The system should give informative warning messages if the user attempts to perform a task without pre-conditions being satisfied. Where appropriate, upon a task being performed, the system may automatically carry out pre-conditions before performing the requested task. 7.2.2 Hardware and Software Constraint The application should be easily extensible and scalable. Developers should be able to add both extra functionality and expand the workload the application can handle with relative ease. The design should consider the future enhancement of the system and should be reasonably easy to maintain and upgrade. Codes should also be well documented. The system should use an RDBMS to manage its data layer, but be independent of the RDBMS it uses to manage its data. 22 of 93
  23. 23. 7.2.3 Documentation Help menus and tool tips will be available to help users interact with the system. The application will also come with a user manual, including screen shots. The application will be available along with written documentation for its installation and configuration. 7.3 System Framework It was decided to build the system with a number of components. Each component has a specialised function in the system. 7.3 illustrates the main components and the system boundary. The next section will describe the functions of each component in more detail and section 7.5 contains the class diagram. By isolating system responsibilities the following main components were identified. • User interface • Display Manager • Classifier (Central Manager, STClassifier Manager, STClassifier) • Sampling Set Generator • Pre-processor • Cross-validation • Results Manager (Database Manager, OLEDB, Database) 7.3 shows how the system is divided into a client/server architecture. The advantage of this set up is its ease of maintenance as the server implementation can be an abstraction to the client. All the functionalities of the system are accessed through the graphical user interface (GUI). The implementation is in the server, isolating users from the system complexities not relevant to the user. One of the main aims of the design of the system was to create a flexible framework. Others... The green boxes seen in 7.3 represent new or alternative components that can be added to the system in the future with relative ease. 23 of 93
  24. 24. Input Data System Boundary Random Graphical User DisplayManager Interface Sampling Set Generator Utility Results Manager Central Manager Pre-processor OLEBD Database STClassifier Manager Manager Stemmer Database STClassifier Cross-Validation Figure 6.System Components and Boundary Graphical User Client Input Data Interface Server DisplayManager Random v Sampling Set Results Manager Central Manager Generator Utility Pre-processor Database STClassifier OLEBD Manager Manager Stemmer Database STClassifier Cross-Validation Figure 7.Client Server Division 24 of 93
  25. 25. Graphical User Others... Interface Random Others... Input Data DisplayManager Sampling Set Generator Utility Others... Results Manager Central Manager Pre-processor Database STClassifier OLEBD Others... Manager Manager Stemmer Others.. STClassifier Cross-Validation Database Figure 8.Additional or Alternative Components 7.4 Components in Detail 7.4.1 The Client - User Interface Graphical User Interface The user interacts with the system via a single graphical user interface which is also the client. In this project the client is implemented as a set of Windows forms and controls in .NET. There is one main form where users can access all the functionalities of the system. There are a number of other dialog boxes and forms to help with the navigation and interaction with the system. For example there is a Select Scoring Method form, used to request from the user the scoring methodology to use when scoring a new document. Other more generic forms such as the Select Dialog form are employed for a number of uses and do not display specific types of information (see section 10 Implementation Specifics for further discussion). The client is simply an event handler for each of the GUI controls that calls the Central Manager via the Display Manager for actual data processing. The GUI contains no implementation, but delegates to the Display Manager, thus decoupling the interface from the implementation. There is a two-way communication between the client and the Display Manager, whereby a user invokes an event and related messages are passed to the Central Manager. The Central Manager passes the messages to the Central Manager which subsequently either delegates to other more specialised controllers to handle the task, or resolves the request itself. The design of the screens was done in speaking with potential users. The user should be able to perform all the tasks described by the use cases seen earlier in the Functional Requirements section (the functions will not be reiterated here). 25 of 93
  26. 26. For this project Windows forms were chosen for the implementation because most users are familiar with the Windows form interface. It creates a familiar interface on initial interaction with the system and facilitates use of the system. In particular, the .NET framework provides a wealth of controls and functionalities, which help to build a user friendly interface and hides the complexity of the underlying workings from the user. The different components are built as separate classes and the user interface or the client can be implemented using a different methodology from Windows forms, such as command line as illustrated. Select Select Scoring Dialog Method Graphical User Command Line Interface Input Data Display Manager Figure 9.Client interface and Its Collaborating Components 7.4.2 Display Manager DisplayManager The Display Manager is a layer between the User Interface and the Central Manager and the rest of the system. It essentially passes messages between these two components. The Display Manager is responsible for information displayed back to the user and it manages also the input data. Graphical User Others... Interface Input Data DisplayManager Central Manager 7.4.3 The Classifier It was mentioned in the previous section that the Central Manager is part of the classifier. 7.4.3 illustrates the classifier, which is enclosed by the red box and its 26 of 93
  27. 27. connecting components. The classifier comprises of the Central Manager, a controller that manages the underlying model of the classifier, and the underlying model itself. The Central Manager is a controller that handles the communication between all the main components in the system which communicates with the classifier. The Central Manager should provide the following functionalities: • Select Sampling Set for a corpus • Pre-process all documents in a corpus • Run cross-validation on a corpus • Create a classifier for a given corpus • Score all documents in a corpus • Classify all documents in a corpus • Obtain classification results for a corpus There are further controller classes called by the Central Manager to provide more specialised functionalities, these are the Output Manager, Suffix Tree Manager, Sampling Set Generator, Pre-processor, and Cross-validation. When a user loads a corpus into the system it is managed by the Central Manager. If there is a request to create a sampling set for example, the Central Manager should know where the corpus is located and delegates the Sampling Set Generator the task of creating a sampling set based on parameters set by the user. Similarly, a request from the user to perform pre-processing on the corpus is delegated to the Pre-processor to carry out the task by the central manager. The various components is designed to have specialised tasks, they do not need to know where the data is located as this information is passed to the components when the Central Manger invokes a request. The Sampling Set generator does not need to know how the Pre-processor carries out its task, nor does it need to know about the Cross-validation component. The three components receive data and requests from the Central Manager, perform its task and return any information back to the Central Manager. The classifier has to be connected to an internal model. In this project the suffix tree data structure is employed to model the representation of document characteristics. As seen in 7.4.3, the classifier can be implemented with different types of models such as a Naïve Bayesian or Neural Networks. There is a dual way communication between the Central Manager and the STClassifier via the STClassifier Manager. The STClassifier is a DLL library built by Birkbeck research. It provides public interfaces to: • Building the representation of documents using the suffix tree data structure • Training the classifier • Score a document • Returns classification results 27 of 93
  28. 28. The STClassifier Manager controls the flow of messages between the Central Manager and the STClassifier. Responsibilities involve converting data to the format that is accepted by the STClassifier, and converting output from the STClassifier which is passed back to the STClassifier Manager. It is essentially a wrapper class for the STClassifier. The suffix tree is built using the contents of documents in a training set. Once a suffix tree is built it will be cached in an ArrayList that is managed by the STClassifier Manager. An ArrayList is a C# collection class implemented in .NET. The suffix tree remains stored in memory until the user activates an event to delete the suffix tree. As a result the system does not need to create a suffix tree every subsequent action that references it. Hence, only methods in the STClassifier Manager are called and it is not necessary to call methods in the STClassifier. The classifier generates output data when a request is invoked to classify and score documents. These two actions can be a time consuming activities. The Central Manager decides what type of output data needs to be saved and passes the data from the classifier to the Results Manager to handle. Section 7.4.6 describes the design of the Results manager. Graphical User Interface Command Line Results Manager Display Manager Sampling Set Generator Central Manager Pre-processor NBClassifier NNClassifier STClassifier Manager Manager Manager Cross-Validation NBClassifier NNClassifier STClassifier Classifier Figure 10.The Classifier and Its Collaborating Components 7.4.4 Data Manipulation and Cleansing 28 of 93
  29. 29. Sampling Set Pre-processor Generator When a corpus is loaded into the system as input data. The user can create sampling sets from the initial corpus and also prepare the data for experimentation by performing various types of pre-processing on the data. The input data is given to the classifier, which sends it to the Sampling Set Generator to handle the generation of sampling sets. Various sampling methodologies can be plugged into the Sampling Set Generator. For this project the system will implement random sampling and systematic sampling methodologies. The pre-processor provides the functionality for pre-processing data passed to it. Similarly, various methods of pre-processing can be plugged into the system with relative ease. Currently, the system provides stemming, stop word removal, and punctuation removal. In order for a method to plug into the system, a method class must implement an IMethod interface so that it guarantees the following: • A method class must have a name property to return the name of the method. This is necessary, so if new methods are added to the system it will be identified by its name. • A method class must have a Run method. This method is where all the work is done A set of utility classes will provide helper functionalities such as random number generator, common divisor, and file system. Systematic Random Snowball Sampling Set Generator Utility Central Manager Pre-processor Stop Word Punctuation Stemmer Others.. Removal Removal Figure 11.Data Manipulation and Cleansing Components and Its Collaborating Components 7.4.5 Experimentation Cross-Validation 29 of 93
  30. 30. Setting up data for experimentation is the main responsibly of the Cross-validation class. The Central Manager passes a corpus to the Cross-validation component, which uses the data to build N-fold cross-validation sets. It divides the given set of corpus into N blocks and builds a training set and test set for each N run. The data is stored as an array that is passed back to the Central Manager. The methods the Cross-Validation class is expected to perform are: • Set the number of N-folds • Run N-fold cross-validation on a given source data • Return the cross-validation sets in an array data structure Central Manager Cross-Validation Figure 12.Cross-validation and Its Collaborating Components 7.4.6 Results Manager Results Manager The Results Manager handles the output of the classifier and the repository of the output. The underlying RDBMS of this project is an Access database, which is used to cache the data generated by the classifier. The OLEDB component is responsible for the direct communication with the database. This class needs to provide the basic database functionalities such as read/write/ delete in a generic fashion. It is through the Database Manager object that all communication with the OLEDB library occurs, and the data flow between the Results Manager. The Database Manager manages the OLEDB. The green boxes illustrate that the information system for the system does not necessarily has to be an Access database. The system is designed to be able to store the data using a different means with relative ease, e.g. XML files, SQL server etc. 30 of 93
  31. 31. Results Manager Central Manager XML File Database Manager Manager XML OLEDB XML File(s) Database Figure 13.Results Manager and Its Collaborating Components 7.4.7 Error Handling Adequate error handling for an end user application is essential. Displays of warnings and errors should be handled in the higher level of the system, namely by the Display manager and then displayed to the user in a reasonable fashion. Errors that occur in the other classes should be propagated to the Display Manager. All classes apart from the User Interface and the Display Manager are expected to implement an IErrorRecord interface. A class that implements this interface will guarantee that it has a property called error which returns the error message. 31 of 93

×