msword
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
667
On Slideshare
667
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A Management and Visualisation Tool for Text Mining Applications Student Peishan Mao MSc Computing Science Project Report School of Computing Science and Information System Birkbeck College, University of London 2005 Status Draft Last saved 26 Apr. 10 1 of 93
  • 2. 1 TABLE OF CONTENTS 1 TABLE OF CONTENTS...............................................................................................2 2 ACKNOWLEDGEMENT..............................................................................................5 3 ABSTRACT.................................................................................................................6 4 INTRODUCTION..........................................................................................................7 5 BACKGROUND...........................................................................................................8 5.1 Written Text...........................................................................................................................8 5.2 Natural Language Text Classification..................................................................................8 5.2.1 Text Classification............................................................................................................8 5.2.2 The Classifier....................................................................................................................9 5.3 Text Classifier Experimentations........................................................................................12 6 HIGH-LEVEL APPLICATION DESCRIPTION...........................................................14 6.1 Description and Rationale...................................................................................................14 6.1.1 Build a Classifier.............................................................................................................14 6.1.2 Evaluate and Refine the Classifier..................................................................................15 6.2 Development and Technologies...........................................................................................15 7 DESIGN.....................................................................................................................17 7.1 Functional Requirements....................................................................................................17 7.2 Non-Functional Requirements............................................................................................22 7.2.1 Usability..........................................................................................................................22 7.2.2 Hardware and Software Constraint ................................................................................22 7.2.3 Documentation................................................................................................................23 7.3 System Framework..............................................................................................................23 7.4 Components in Detail...........................................................................................................25 7.4.1 The Client - User Interface..............................................................................................25 7.4.2 Display Manager.............................................................................................................26 7.4.3 The Classifier..................................................................................................................26 7.4.4 Data Manipulation and Cleansing...................................................................................28 ...............................................................................................................................................28 7.4.5 Experimentation..............................................................................................................29 7.4.6 Results Manager..............................................................................................................30 2 of 93
  • 3. 7.4.7 Error Handling................................................................................................................31 7.5 Class Diagram......................................................................................................................32 8 DATABASE...............................................................................................................33 8.1 Entities .................................................................................................................................33 8.1.1 Score Table ....................................................................................................................33 8.1.2 Source Table ..................................................................................................................33 8.1.3 Configuration Table .......................................................................................................33 8.1.4 Score Functions Table.....................................................................................................33 8.1.5 Match Normalisation Functions Table............................................................................34 8.1.6 Tree Normalisation Functions Table...............................................................................34 8.1.7 Classification Condition Table........................................................................................34 8.1.8 Class Weights Table........................................................................................................34 8.1.9 Temporary Max and Min Score Table............................................................................34 8.2 Views.....................................................................................................................................35 8.2.1 Weighted Scores.............................................................................................................35 8.2.2 Maximum and Minimum Scores.....................................................................................35 8.2.3 Misclassified Documents................................................................................................35 8.3 Relation Design for the Main Tables..................................................................................35 9 IMPLEMENTATION...................................................................................................37 9.1 Main User Interface.............................................................................................................37 9.2 Display Manager..................................................................................................................39 9.3 Classifier Classes..................................................................................................................40 9.4 Results Output Classes........................................................................................................41 9.5 Other Controller Classes.....................................................................................................43 9.6 TreeView Controller Class..................................................................................................44 9.7 Error Interface.....................................................................................................................45 10 IMPLEMENTATION SPECIFICS.............................................................................46 10.1 Generic Selection Form Class...........................................................................................46 10.2 Visualisation of the Suffix Tree.........................................................................................48 10.3 Dynamic Sub-String Matching..........................................................................................49 10.4 User Interaction Warnings................................................................................................50 3 of 93
  • 4. 11 USER GUIDE...........................................................................................................53 11.1 Getting Started...................................................................................................................53 11.1.1 Input Data.....................................................................................................................53 11.2 Loading a Resource Corpus .............................................................................................54 11.3 Selecting a Sampling Set....................................................................................................57 11.4 Performing Pre-processing................................................................................................61 11.5 Running N-Fold Cross-Validation....................................................................................64 11.5.1 Set Up Cross-Validation Set..........................................................................................64 11.5.2 Perform experiments on the data...................................................................................67 11.5.2.1 Create the Suffix Tree............................................................................................67 11.5.2.2 Display Suffix Tree................................................................................................69 11.5.2.3 Delete Suffix Tree..................................................................................................71 11.5.2.4 N-Gram Matching..................................................................................................71 11.5.2.5 Score Documents...................................................................................................73 11.5.2.6 Classify documents................................................................................................74 11.5.2.7 Add New Document to Classify.............................................................................76 11.6 Creating a Classifier..........................................................................................................79 12 TESTING..................................................................................................................81 13 CONCLUSION.........................................................................................................83 13.1 Evaluation...........................................................................................................................83 13.2 Future Work.......................................................................................................................84 14 BIBLIOGRAPHY......................................................................................................86 15 APPENDIX A DATABASE.......................................................................................88 16 APPENDIX B CLASS DEFINITIONS.......................................................................90 17 APPENDIX C SOURCE CODE................................................................................93 4 of 93
  • 5. 2 ACKNOWLEDGEMENT I would like to thank the following people for their help over the course of this project: Rajesh Pampapathi: for his spectrum of help on the project, ranging from his patient and advice on the whole area of text classification, and pointing me in the right direction for information on the topic to being interviewed as a potential user to the proposed system as part of the requirement collection. Timothy Yip: for laboriously proof reading the draft for the report despite not having much interest in information technology. 5 of 93
  • 6. 3 ABSTRACT This report describes the design and implementation of a management and visualisation tool for text classification applications. The system is built as a wrapper for machine learning classification tool. It aims to provide a flexible framework to accommodate for future changes to the system. The system is implemented in C# .Net with a Windows Forms front end and an Access Database as an example, but should be flexible enough to add different underlying components. 6 of 93
  • 7. 4 INTRODUCTION This report describes the project carried out to implement a management and visualisation tool for text classification. It covers background information about the project, the design, implementation and conclusion. The report is organised as follows: Section 4 this section. It describes the organisation of the report. Section 5 takes a look at the background of the project. This section covers discussion on natural language classification, and suffix tree data structure used in Pampapathi et al’s study. Section 6 a high-level description and rationale of the system. Section 7 describes the design of the system. Lays out the system requirements, system framework, and describes system components and classes. Section 8 explains the database design and description of the database entities and table relations. Section 9 discusses how the system was implemented and goes into class definitions. Section 10 focuses on specific system implementations and looks at the implementation of the generic selection form class, visualisation of the suffix tree, dynamic sub-string matching on documents, and user warnings. Section 11 is the user guide to the system. Section 13 concludes the project. This section discusses whether the system built has met the requirements laid out at the beginning of the project. It also looks at future work. Appendix A Database Appendix B Class Definitions 7 of 93
  • 8. 5 BACKGROUND 5.1 Written Text Writing has long been an important means of exchanging information, ideas and concepts from one individual to another, or to a group. Indeed, it is even thought to be the single most advantageous evolutionary adaptation for species preservation [2]. The written text available contains a vast amount of information. The advent of the internet and on-line documents has contributed to the proliferation of digital textual data readily available for our perusal. Consequently, it is increasingly important to have a systematic method of organising this corpus of information. Tools for textual data mining are proving to be increasingly important to our growing mass of text based data. The discipline of computing science has provided significant contributions to this area by means of automating the data mining process. To encode unstructured text data into a more structured form is not a straightforward task. Natural language is rich and ambiguous. Working with free text is one of the most challenging areas in computer science. This project aims to investigate how computer science can help to evaluate some of the vast amounts of textual information available to us, and how to provide a convenient way to access this type of unstructured data. In particular, the focus will be on the data classification aspect of data mining. The next section will explore this topic in more depth. 5.2 Natural Language Text Classification 5.2.1 Text Classification F Sebastiani [3] described automated text categorisation as “The task of automatically sorting a set of documents into categories (or classes, or topics) from a predefined set. The task, that falls at the crossroads of information retrieval, machine learning, and (statistical) natural language processing, has witnessed a booming interest in the last ten years from researchers and developers alike.” Classification maps data into predefined groups or classes. Examples of classification applications include image and pattern recognition, medical diagnosis, loan approval, detecting faults in industry applications, and classifying financial trends. Until the late 80’s, knowledge engineering was the dominant paradigm in automated text categorisation. Knowledge engineering consists of the manual definition of a set of rules which form part of a classifier by domain experts. Although this approach has produced results with accuracies as high as 90% [3], it is labour intensive and domain specific. The emergence of a new paradigm based on machine learning which answers many of the limitations with knowledge engineering has superseded its predecessor. Machine learning encompasses a variety of methods that represent the convergence of statistics, biological modelling, adaptive control theory, psychology, and artificial 8 of 93
  • 9. intelligence (AI) [11]. Data classification by machine learning is a two-phase process (5.2.1). The first phase involves a general inductive process to automatically build a model by using classification algorithm that describes a predetermined set of data classes which are non-overlapping. This step is referred to as supervised learning because the classes are determined before examining the data and the set of data is known as the training data set. Data in text classification comes in the form of files and each file is often described as documents. Classification algorithms require that the classes are defined based on purely the content of the documents. They describe these classes by looking at the characteristics of the documents in the training set already known to belong to the class. The learned model constitutes the classifier and can be used to categorise future corpus samples. In the second phase, the classifier constructed in the phase one is used for classification. Machine leaning approach to text classification is less labour intensive, and is domain independent. Since the attribution of documents to categories is based purely on the content of the documents effort is thus concentrated on constructing an automatic builder of classifiers (also known as the learner), and not the classifier itself [3]. The automatic builder is a tool that extracts the characteristics from the training set which is represented by a classification model. This means that once a learner is built, new classifiers can be automatically constructed from sets of manually classified documents. Training Classification Classification Set Algorithm Model a) Classification Model Test Set New Documents b) Figure 1.a) Step One in Text Classification b) Step two in text classification 5.2.2 The Classifier In general a text classifier comprises a number of basic components. As noted in the previous section, the text classifier begins with an inductive stage. A classifier requires some sort of text representation of documents. In order to build an internal model the inductive step involves a set of examples used for training the classifier. This set of examples is known as the training set and each document in the training set is assigned to a class C = {c1, c2, … cn}. All the documents used in the training phase are transformed into internal representations. Currently, a dominant learning method in text classification is based on a vector space model [5]. The Naïve Bayesian is one example and is often used as a benchmark in text classification experiments. Bayesian classifiers are statistical classifiers. Classification is based on the probability that a given document belongs to a particular class. The 9 of 93
  • 10. approach is ‘naïve’ because it assumes that the contribution by all attributes on a given class is independent and each contributed equally to the classification problem. By analysing the contribution of each ‘independent’ attribute, a conditional probability is determined. Attributes in this approach are the words that appear in the documents of the training set. Documents are represented by a vector with dimensions equal to the number of different words within the documents of the training set. The value of each individual entry within the vector is set at the frequency of the corresponding word. According to this approach, training data are used to estimate parameters of a probability distribution, and Bayes theorem is used to estimate the probability of a class. A new document is assigned to the class that yields the highest probability. It is important to perform pre-processing to remove frequent words such as stop words before a training set is used in the inductive phase. The Naïve Bayesian approach has several advantages. Firstly, it is easy to use; secondly only one scan of the training data is required. It can also easily handle missing values by simply omitting that probability when calculating the likelihoods of membership in each class. Although the Naïve Bayesian-based classifier is popular, documents are represented as a ‘bag-of-words’ where words in the document have no relationships with each other. However words that appear in a document are usually not independent. Furthermore, the smallest unit of representation is a word. Research is continuously investigating how designs of text classifiers can be further improved and Pampapathi et al [1] at Birkbeck College, London recently proposed a new innovative approach to the internal modelling of text classifiers. They used a well known data structure called a suffix tree [11] which allows for indexing the characteristics of documents at a more granular level, with documents represented by substrings. The suffix tree is a compact trie containing all the suffixes of strings represented. A trie is a tree structure, where each node represents one character, and the root represents the null string. Each path from the root represents a string, described by the characters labelling the nodes traversed. All strings sharing a common prefix will branch off from a common node. When strings are words over a to z, a node has at most 26 children, one for each letter (or 27 children, plus a terminator). Suffix trees have traditionally been used for complex string matching problems in matching string sequences (data compression, DNA sequencing). Pampapathi et al’s research is the first to apply suffix trees to natural language text classification. Pampapathi et al’s method of constructing the suffix tree varies slightly from the standard way. Firstly, the tree nodes are labelled instead of the edges in order to associate directly the frequency with the characters and substrings. Secondly, a special terminal character is not used as the focus is on the substrings and not the suffixes. Each suffix tree has a depth. The depth is described by the maximum number of levels in the tree. A level is defined by the number of nodes away from the root node. For example the suffix tree illustrated in 5.2.3 has a depth of 4. Pampapathi et al’s sets a limit to the tree depth and each node of the suffix tree stores the frequency and the character. For example, to construct a suffix tree for the string S1 = “COOL”, the suffix tree in 5.2.3 is created. The substrings are COOL; OOL; OL; and L. 10 of 93
  • 11. C (1) O (1) O (1) L (1) Root O (1) L (1) O (1) L (1) L (1) Figure 2.Suffix Tree for String ‘COOL’ If a second string S2 =”FOOL” is inserted into the suffix tree, it will look like the diagram illustrated in 5.2.3. The substrings for S2 are FOOL; OOL; OL; and L. Notice that the last three substrings in S2 are duplicates of some of the substrings already seen in S1, and new nodes are not created for these repeated substrings. F (1) O (1) O (1) L (1) Root C (1) O (1) O (1) L (1) O (2) L (2) O (2) L (2) L (2) Figure 3.Suffix Tree with String ‘FOOL’ Added Similar to the Naïve Bayesian method, a classifier using the suffix tree for its internal model undergoes supervised learning from a training set which contains documents that have been pre-classified into classes. Unlike the Naïve Bayesian approach, the suffix tree, by capturing the characteristics of documents at the character level, does not require pre-processing of the training set. A suffix tree is built for each class and a new document is classified by scoring it against each of the trees. The class of the highest scoring tree is assigned to the document. Pampapathi et al’s study was based on email 11 of 93
  • 12. classification and the result of the experiment showed that a classifier employing a suffix tree outperformed the Naïve Bayesian method. In order to solve a classification problem, not only is the classifier one of the central components, but as seen with the Naïve Bayesian method it is also important to perform pre-processing on data used for training. The next section looks at other processes involved in text classification other than the classifier component itself. 5.3 Text Classifier Experimentations As described in previous sections that there is a two-step process to classification: 1. Create a specific model by evaluating the training data. This step has as input the training data (including the category/class labels) and as output a definition of the model developed. The model created which is the classifier classifies the training data as accurately as possible. 2. Apply the model developed by classifying new sets of documents. In the research community or for those interested in evaluating the performance of a classifier the second step can be more involved. First, the predictive accuracy of the classifier is estimated. A simple yet popular technique is called the holdout method which uses a test set of class-labelled samples. These samples are usually randomly selected and it is important that they are independent of the training samples, otherwise the estimate could be optimistic since the learned model is based on that data, and therefore tend to overfit. The accuracy of a classifier on a given test set is the percentage of test set samples that are correctly classified by the classifier. For each test sample the known class label is compared with the classifier’s class prediction for that sample. If the accuracy of the classifier model is considered as acceptable, the model can be used to classify new documents. Training Derive Estimate Set Classifier Accuracy Corpus data Test Set Figure 4.Estimating Classifier Accuracy with the Holdout Method The estimate using the holdout method is pessimistic since only a portion of the initial data is used to derive the classifier. Another technique call N-fold cross-validation is often used in research. Cross-validation is a statistical technique which can mitigate bias caused by a particular partition of training and test set. It is also useful when the amount of data is limited. The method can be used to evaluate and estimate the performance of a classifier, and the aim is to obtain as honest an estimation as possible about the classification accuracy of the system. N-fold cross-validation involves 12 of 93
  • 13. partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping blocks/folds. Then the training-testing process is run N times, with a different test set. For example, when N=3, we will have the following training and test sets. Block 1 Train Test Run 1 1, 2 3 Block 2 Run 2 1, 3 2 Block 3 Run 3 2, 3 1 Figure 5.3-Fold Cross-Validation For each cross-validation run the user will be able to use a training set to build the classifier. Stratified N-fold cross-validation is a recommended method for estimating classifier accuracy due to its low bias and variance [13]. In stratified cross-validation, the folds are stratified so that the class distribution of the samples in each fold is approximately the same as that of the initial training set. Preparing the training set data for classification using pre-processing can help improve the accuracy, efficiency, and scalability of the evaluation of the classification. Methods include stop word removal, punctuation removal, and stemming. The use of the above techniques to prepare the data and estimate classifier accuracy increases the overall computational time yet is useful for evaluating a classifier, and selecting among several classifiers. The current project aims to build a system which is a wrapper to a text classifier and incorporates the suffix tree that was used in the research done by Pampapathi et al as an example. The next section and beyond describes the project in detail. 13 of 93
  • 14. 6 HIGH-LEVEL APPLICATION DESCRIPTION 6.1 Description and Rationale The aim of this project is to build a management and visualisation tool that will allow researchers to perform data manipulation support for underlying text classification algorithms. The tool will provide a software infrastructure for a data mining system based on machine learning. The goal is to build a flexible framework that would allow changes to the underlying components with relative ease. Functions maybe added to the system in the future. Adding new functionalities should have minimal effect on the current system. The system will be built as a wrapper for the two-step process involved in classification. First, a component will be built that will automatically build a classifier given some training data. Secondly, to provide capabilities to perform classification and evaluate the performance of a classifier. Additionally, the tool will provide functionalities to run data sampling and various pre-processing on data. For the researcher it is incumbent to clearly define the training set (this will be known as the ‘resource corpus’ in this report) used for the training the classifier. When the resource corpus is small the user can choose to use the entire corpus in the study. If the resource corpus is large, the tool gives the option to select sampling sets to represent it. A number of sampling methodologies is implemented that allows the user to select a sample, which will reflect the characteristics of the resource corpus from which it is drawn. Note that a resource corpus is grouped into classes and this structure needs to be taken into consideration when the sampling mechanism was developed. Three popular sampling methods will be developed. Although other sampling methods can be added, such as convenience sampling, judgement sampling, quota sampling, and snowball sampling. Note that the user can choose to evaluate data used to construct the classier before actually building the classifier. The tool will be designed to be generic enough to analyse a corpus of any categorisation type e.g. automated indexing of scientific articles, emails routing, spam filtering, criminal profiling, and expertise profiling. 6.1.1 Build a Classifier The tool allows the user to build a classifier. The current framework only implements the suffix tree-based classifier developed by Birkbeck College using the suffix tree, but will be flexible enough to incorporate other classification models in the future. The research on suffix trees applied to classification is new, and there is currently no such application. The learning process of the classifier follows the machine learning approach to automated text classification, whereby the system automatically builds a classifier for the categories of interest. From the graphical user interface (GUI), the user can select a corpus to use as training data. The application provides links to .dll files developed by Birkbeck College which allow the user to build a suffix tree from the selected corpus. The internal data representation is constructed by generalising from a training set of pre- classified documents. Once the classifier is built the user can load new documents into the system to be classified. 14 of 93
  • 15. 6.1.2 Evaluate and Refine the Classifier In research once a classifier has been built it is desirable to evaluate its effectiveness. Even before the construction of the classifier the tool provides a platform for users to perform a number of experiments and refinements on the source (training) data. Hence, the second focus of the project is to provide a user-friendly front-end and a base application for testing classification algorithms. The user can load in a text based corpus and perform standard pre-processing functions to remove noise and prepare the data for experimentation. There is also a choice of sampling methods to use in order to reduce the size of the initial corpus making it more manageable. Sebastiani [2] notes that any classifier is prone to classification error, whether the classifier is human or machine. This is due to a central notion to text classification that the membership of a document in a class based on the characteristics of the document and the class is inherently subjective, since the characteristics of both the documents and class cannot be formally specified. As a result automatic text classifiers are evaluated using a set of pre-classified documents. The accuracy of classifiers is compared to the classification decision and the original category the documents were assigned to. For experimentation and evaluation purpose, this set of pre-classified documents is split into two sets: a training set and test set, not necessarily of equal sizes. The tool implements an extra level of experimentation using n-fold cross-validation. When employing cross-validation in classification it must take into account that the data is grouped by classes therefore this project will implement stratified cross-validation. Once a classifier has been constructed, it is possible to perform data classification experiments as well as other tasks such as single document analysis. For example, for the implementation of a suffix tree-based classifier the user will be able to view the structure of the suffix tree, as well, the documents in the test sets or load a new document and obtain a full matrix of output data about it. The output data is persisted in an information system which is subsequently used to perform analysis and visualisation tasks. 6.2 Development and Technologies Development was done in C#, using the .NET framework. The architect of the system was designed to be an extensible platform to enable users and developers to leverage the existing framework for future system upgrades. The tool was built from several components and aims to be modular. There are a number of controller components to provide functionalities for the tool. A set of libraries is used to provide the functionalities for the suffix tree. Working closely with researchers from Birkbeck College on the interface, these libraries for the suffix tree were provided by Birkbeck College. The suffix tree data structure is built in memory and can become very large. One solution to better utilise resources is to have the data structure physically stored as one tree, although it is logically represented as individual trees for each class. Further discussion can be found in subsequent sections. 15 of 93
  • 16. A Windows application was built as the client. This forms the interface that the user interacts with to gain access to the functionalities of the tool. The output data is cached in a database. The main targeted users for the tool are researchers in the research community for natural language text classification, and other users who want to mine textual data. 16 of 93
  • 17. 7 DESIGN 7.1 Functional Requirements Requirements for the application were collected from research on natural language text classification and discussions with targeted users in the research community. Requirements are the capabilities and conditions to which the application must conform. The functional requirements of the system are captured using ‘use cases’. Use cases are a useful tool in describing how a user interacts with a system. They are written stories that describe the interaction between the system and the user that is easy to understand. Requirements can often change over the course of development and for this reason there was no attempt to define and freeze all requirements from the onset of the project. The following use cases were produced. Note some use cases were added throughout the development of the system Use Case Name: Load Directory as Source Corpus Primary Actor: User Pre-conditions: The application is running Post-conditions: A source corpus is loaded into the application Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. The user selects a valid directory 2. The system checks for directory and has at least read access to the path validity and access directory, and loads it as a corpus 3. Builds a tree structure of classes into the system based on the sub-folders in the directory and displays the classes in the GUI Use Case Name: View a Document in Corpus Primary Actor: User Pre-conditions: A corpus is successfully loaded Post-conditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select the document to view 2. Display content of document in the GUI Use Case Name: Create Sampling Set 17 of 93
  • 18. Primary Actor: User Preconditions: A source corpus is successfully loaded Postconditions: A sampling set based on the source corpus is created. New file directory created for the corpus. Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects how they want to 3. Creates a sampling set based on select the sampling set parameters given by the user 2. User specifies location to store the 4. Creates the directory structure and documents/files created for the document/files in the location sampling set specified by the user 5. Displays new corpus created in the GUI Use Case Name: Run Pre-Processing Primary Actor: User Pre-conditions: A training set exist in the system Post-conditions: A new pre-processed sampling set created. New file directory created for the corpus. Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select type of pre-processing to 4. Performs pre-processing perform 5. Creates a new pre-processed set 2. User specifies location to store the 6. Stores the directory structure and documents/files created for the pre- documents/files at the location pre-processing set specified by the user. 3. Run pre-processing 7. Displays the corpus as a directory structure in the GUI Use Case Name: Run N-Fold Cross-Validation Primary Actor: User Preconditions: A sampling set is successfully created Postconditions: N-fold cross-validation set is created virtually Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects sampling set to 2. Builds n-fold cross-validation set process and the number of fold based on parameters given by the user, which includes the n-runs, 18 of 93
  • 19. each run containing training set and test set. 3. Displays new cross-validation set created in the GUI Use Case Name Create Classifier (Suffix Tree) Primary Actor: User Preconditions: A cross-validation set or classification set exist Postconditions: Classifier created in memory Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User actives an event to build 3. Builds classifier in memory, based classifier for a cross-validation set on the corpus set selected or classification set 4. indicate in the GUI that the 2. User choose any additional classifier of the corpus has been conditions to apply created Use Case Name: Score Documents Primary Actor: User Preconditions: An n-fold cross-validation set is created. Classifier for the corpus set is created Postconditions: Documents in the cross-validation set is scored and data stored in the database Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects the cross-validation 2. Scores all documents under the run to score selected corpus set 3. Inserts score data into database Use Case Name: Classify Documents Primary Actor: User Preconditions: An n-fold cross-validation set is created. Classifier for the set is created and the documents have been scored Postconditions: Misclassified documents in the cross-validation set is flagged Main Success Scenarios: Actor Action (or Intention) System Responsibility 19 of 93
  • 20. 1. User selects the cross-validation 2. Classify all documents under the run to classify selected cross-validation set 3. Flag all misclassified documents in the GUI Use Case Name: Create Classification Set Primary Actor: User Preconditions: A source corpus is successfully loaded Postconditions: A classification set is created virtually Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects the corpus set they 2. Display new corpus created in the want to use to create a classifier GUI as a classification corpus set Use Case Name: Load New Document to Classify Primary Actor: User Preconditions: Cross-validation set or classification set exist Postconditions: Substring matches and relates output data is store in database Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User decides which suffix tree to 2. Document name and relevant use for classification and loads in a information is displayed in the GUI valid textual document as an item ready to be analysed to be classified and analysed 3. Score and classify document 4. Stores output data in database Use Case Name: View a Document Primary Actor: User Pre-conditions: Document loaded into the system Post-conditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select the document to view 2. Display content of document on GUI 20 of 93
  • 21. Use Case Name View n-Gram Matches in document Primary Actor: User Preconditions: The document in concern is successfully loaded and suffix classifier created Postconditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects a string/substring in a 2. Queries the classifier to retrieve the document to match n length substring matches 3. Displays to user the frequency for the string/substring selected Use Case Name View Statistics on Matches Primary Actor: User Preconditions: Document successfully loaded, scored and output exists in database Postconditions: Displays information in GUI Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects to view output 2. System queries and retrieves relevant data in the database 3. Displays the output in table form in the GUI Use Case Name Visualise Representation of Classifier (View Suffix Tree) Primary Actor: User Preconditions: Classifier was successfully built Postconditions: Classifier visual representation displayed on GUI Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects option to display suffix 2. Builds visual representation of the tree classifier and displays in GUI 21 of 93
  • 22. Use Case Name Delete Classifier Primary Actor: User Preconditions: Classifier was successfully built Postconditions: Classifier is deleted Main Success Scenarios: Actor Action (or Intention) System Responsibility 3. User selects classifier to delete 4. Remove classifier, and clear displayed tree in GUI 7.2 Non-Functional Requirements The non-functional requirements for the use cases are as follows. 7.2.1 Usability The user should have one main single user interface to interact with the system. The user interface should be user friendly and the complexity of computation e.g. building an n-fold cross-validation set, scoring documents against a classification model, should be hidden from the user. An experimental run of the suffix tree classifier could involve as many as 126 scoring configurations, all of which could together take some considerable time to calculate. It therefore makes sense to keep a store of all calculated scores, rather than calculate them on-the-fly whenever they are requested. The results will be cached in a data store, which is implemented as database in this project. Hence, optimizing system responsiveness. Some system requests can only be activated once a pre-condition has been satisfied e.g. the user can only score documents when the suffix tree has been created. The system should give informative warning messages if the user attempts to perform a task without pre-conditions being satisfied. Where appropriate, upon a task being performed, the system may automatically carry out pre-conditions before performing the requested task. 7.2.2 Hardware and Software Constraint The application should be easily extensible and scalable. Developers should be able to add both extra functionality and expand the workload the application can handle with relative ease. The design should consider the future enhancement of the system and should be reasonably easy to maintain and upgrade. Codes should also be well documented. The system should use an RDBMS to manage its data layer, but be independent of the RDBMS it uses to manage its data. 22 of 93
  • 23. 7.2.3 Documentation Help menus and tool tips will be available to help users interact with the system. The application will also come with a user manual, including screen shots. The application will be available along with written documentation for its installation and configuration. 7.3 System Framework It was decided to build the system with a number of components. Each component has a specialised function in the system. 7.3 illustrates the main components and the system boundary. The next section will describe the functions of each component in more detail and section 7.5 contains the class diagram. By isolating system responsibilities the following main components were identified. • User interface • Display Manager • Classifier (Central Manager, STClassifier Manager, STClassifier) • Sampling Set Generator • Pre-processor • Cross-validation • Results Manager (Database Manager, OLEDB, Database) 7.3 shows how the system is divided into a client/server architecture. The advantage of this set up is its ease of maintenance as the server implementation can be an abstraction to the client. All the functionalities of the system are accessed through the graphical user interface (GUI). The implementation is in the server, isolating users from the system complexities not relevant to the user. One of the main aims of the design of the system was to create a flexible framework. Others... The green boxes seen in 7.3 represent new or alternative components that can be added to the system in the future with relative ease. 23 of 93
  • 24. Input Data System Boundary Random Graphical User DisplayManager Interface Sampling Set Generator Utility Results Manager Central Manager Pre-processor OLEBD Database STClassifier Manager Manager Stemmer Database STClassifier Cross-Validation Figure 6.System Components and Boundary Graphical User Client Input Data Interface Server DisplayManager Random v Sampling Set Results Manager Central Manager Generator Utility Pre-processor Database STClassifier OLEBD Manager Manager Stemmer Database STClassifier Cross-Validation Figure 7.Client Server Division 24 of 93
  • 25. Graphical User Others... Interface Random Others... Input Data DisplayManager Sampling Set Generator Utility Others... Results Manager Central Manager Pre-processor Database STClassifier OLEBD Others... Manager Manager Stemmer Others.. STClassifier Cross-Validation Database Figure 8.Additional or Alternative Components 7.4 Components in Detail 7.4.1 The Client - User Interface Graphical User Interface The user interacts with the system via a single graphical user interface which is also the client. In this project the client is implemented as a set of Windows forms and controls in .NET. There is one main form where users can access all the functionalities of the system. There are a number of other dialog boxes and forms to help with the navigation and interaction with the system. For example there is a Select Scoring Method form, used to request from the user the scoring methodology to use when scoring a new document. Other more generic forms such as the Select Dialog form are employed for a number of uses and do not display specific types of information (see section 10 Implementation Specifics for further discussion). The client is simply an event handler for each of the GUI controls that calls the Central Manager via the Display Manager for actual data processing. The GUI contains no implementation, but delegates to the Display Manager, thus decoupling the interface from the implementation. There is a two-way communication between the client and the Display Manager, whereby a user invokes an event and related messages are passed to the Central Manager. The Central Manager passes the messages to the Central Manager which subsequently either delegates to other more specialised controllers to handle the task, or resolves the request itself. The design of the screens was done in speaking with potential users. The user should be able to perform all the tasks described by the use cases seen earlier in the Functional Requirements section (the functions will not be reiterated here). 25 of 93
  • 26. For this project Windows forms were chosen for the implementation because most users are familiar with the Windows form interface. It creates a familiar interface on initial interaction with the system and facilitates use of the system. In particular, the .NET framework provides a wealth of controls and functionalities, which help to build a user friendly interface and hides the complexity of the underlying workings from the user. The different components are built as separate classes and the user interface or the client can be implemented using a different methodology from Windows forms, such as command line as illustrated. Select Select Scoring Dialog Method Graphical User Command Line Interface Input Data Display Manager Figure 9.Client interface and Its Collaborating Components 7.4.2 Display Manager DisplayManager The Display Manager is a layer between the User Interface and the Central Manager and the rest of the system. It essentially passes messages between these two components. The Display Manager is responsible for information displayed back to the user and it manages also the input data. Graphical User Others... Interface Input Data DisplayManager Central Manager 7.4.3 The Classifier It was mentioned in the previous section that the Central Manager is part of the classifier. 7.4.3 illustrates the classifier, which is enclosed by the red box and its 26 of 93
  • 27. connecting components. The classifier comprises of the Central Manager, a controller that manages the underlying model of the classifier, and the underlying model itself. The Central Manager is a controller that handles the communication between all the main components in the system which communicates with the classifier. The Central Manager should provide the following functionalities: • Select Sampling Set for a corpus • Pre-process all documents in a corpus • Run cross-validation on a corpus • Create a classifier for a given corpus • Score all documents in a corpus • Classify all documents in a corpus • Obtain classification results for a corpus There are further controller classes called by the Central Manager to provide more specialised functionalities, these are the Output Manager, Suffix Tree Manager, Sampling Set Generator, Pre-processor, and Cross-validation. When a user loads a corpus into the system it is managed by the Central Manager. If there is a request to create a sampling set for example, the Central Manager should know where the corpus is located and delegates the Sampling Set Generator the task of creating a sampling set based on parameters set by the user. Similarly, a request from the user to perform pre-processing on the corpus is delegated to the Pre-processor to carry out the task by the central manager. The various components is designed to have specialised tasks, they do not need to know where the data is located as this information is passed to the components when the Central Manger invokes a request. The Sampling Set generator does not need to know how the Pre-processor carries out its task, nor does it need to know about the Cross-validation component. The three components receive data and requests from the Central Manager, perform its task and return any information back to the Central Manager. The classifier has to be connected to an internal model. In this project the suffix tree data structure is employed to model the representation of document characteristics. As seen in 7.4.3, the classifier can be implemented with different types of models such as a Naïve Bayesian or Neural Networks. There is a dual way communication between the Central Manager and the STClassifier via the STClassifier Manager. The STClassifier is a DLL library built by Birkbeck research. It provides public interfaces to: • Building the representation of documents using the suffix tree data structure • Training the classifier • Score a document • Returns classification results 27 of 93
  • 28. The STClassifier Manager controls the flow of messages between the Central Manager and the STClassifier. Responsibilities involve converting data to the format that is accepted by the STClassifier, and converting output from the STClassifier which is passed back to the STClassifier Manager. It is essentially a wrapper class for the STClassifier. The suffix tree is built using the contents of documents in a training set. Once a suffix tree is built it will be cached in an ArrayList that is managed by the STClassifier Manager. An ArrayList is a C# collection class implemented in .NET. The suffix tree remains stored in memory until the user activates an event to delete the suffix tree. As a result the system does not need to create a suffix tree every subsequent action that references it. Hence, only methods in the STClassifier Manager are called and it is not necessary to call methods in the STClassifier. The classifier generates output data when a request is invoked to classify and score documents. These two actions can be a time consuming activities. The Central Manager decides what type of output data needs to be saved and passes the data from the classifier to the Results Manager to handle. Section 7.4.6 describes the design of the Results manager. Graphical User Interface Command Line Results Manager Display Manager Sampling Set Generator Central Manager Pre-processor NBClassifier NNClassifier STClassifier Manager Manager Manager Cross-Validation NBClassifier NNClassifier STClassifier Classifier Figure 10.The Classifier and Its Collaborating Components 7.4.4 Data Manipulation and Cleansing 28 of 93
  • 29. Sampling Set Pre-processor Generator When a corpus is loaded into the system as input data. The user can create sampling sets from the initial corpus and also prepare the data for experimentation by performing various types of pre-processing on the data. The input data is given to the classifier, which sends it to the Sampling Set Generator to handle the generation of sampling sets. Various sampling methodologies can be plugged into the Sampling Set Generator. For this project the system will implement random sampling and systematic sampling methodologies. The pre-processor provides the functionality for pre-processing data passed to it. Similarly, various methods of pre-processing can be plugged into the system with relative ease. Currently, the system provides stemming, stop word removal, and punctuation removal. In order for a method to plug into the system, a method class must implement an IMethod interface so that it guarantees the following: • A method class must have a name property to return the name of the method. This is necessary, so if new methods are added to the system it will be identified by its name. • A method class must have a Run method. This method is where all the work is done A set of utility classes will provide helper functionalities such as random number generator, common divisor, and file system. Systematic Random Snowball Sampling Set Generator Utility Central Manager Pre-processor Stop Word Punctuation Stemmer Others.. Removal Removal Figure 11.Data Manipulation and Cleansing Components and Its Collaborating Components 7.4.5 Experimentation Cross-Validation 29 of 93
  • 30. Setting up data for experimentation is the main responsibly of the Cross-validation class. The Central Manager passes a corpus to the Cross-validation component, which uses the data to build N-fold cross-validation sets. It divides the given set of corpus into N blocks and builds a training set and test set for each N run. The data is stored as an array that is passed back to the Central Manager. The methods the Cross-Validation class is expected to perform are: • Set the number of N-folds • Run N-fold cross-validation on a given source data • Return the cross-validation sets in an array data structure Central Manager Cross-Validation Figure 12.Cross-validation and Its Collaborating Components 7.4.6 Results Manager Results Manager The Results Manager handles the output of the classifier and the repository of the output. The underlying RDBMS of this project is an Access database, which is used to cache the data generated by the classifier. The OLEDB component is responsible for the direct communication with the database. This class needs to provide the basic database functionalities such as read/write/ delete in a generic fashion. It is through the Database Manager object that all communication with the OLEDB library occurs, and the data flow between the Results Manager. The Database Manager manages the OLEDB. The green boxes illustrate that the information system for the system does not necessarily has to be an Access database. The system is designed to be able to store the data using a different means with relative ease, e.g. XML files, SQL server etc. 30 of 93
  • 31. Results Manager Central Manager XML File Database Manager Manager XML OLEDB XML File(s) Database Figure 13.Results Manager and Its Collaborating Components 7.4.7 Error Handling Adequate error handling for an end user application is essential. Displays of warnings and errors should be handled in the higher level of the system, namely by the Display manager and then displayed to the user in a reasonable fashion. Errors that occur in the other classes should be propagated to the Display Manager. All classes apart from the User Interface and the Display Manager are expected to implement an IErrorRecord interface. A class that implements this interface will guarantee that it has a property called error which returns the error message. 31 of 93
  • 32. 7.5 Class Diagram 7.5 shows a class diagram of the main components of the system discussed above Controllers::DisplayManager MainForm -nodeMgr : TreeViewNodeManager -tvExplorer -classifier : CentralManager -sTreeView -dbProvider : string -rtxtView -dbUserId : string -rtxtInfo -dbPassword : string -mItemAddRCorpus_Click(in sender : object , in e ) -dbName : string -mitemSelectSampling _Click(in sender : object, in e) -Controlled By -dbAccessMode : string 1 -mitemPreprocess_Click(in sender : object , in e) +AddNode(in destNode : TreeNode, in nodeNames : string[], in imageIdx : TreeImages, in selectedImageIdx : TreeImages) -mitemCrossValidation _Click(in sender : object , in e) +FindNode(in selectedNode : TreeNode in nodeName : string) : TreeNode , -CreateSTree _Click(in sender : object , in e) 1..* +DisplayBlank () -DeleteSTree_Click(in sender : object , in e) +DisplayFile (in filePathname : string) -DisplaySuffixTree _Click(in sender : object, in e ) +SelectSampleCorpus(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode ) -AddNewDoc_Click(in sender : object, in e) +AddNewClassificationSet (in treeStructure : TreeView, in sourceNode : TreeNode in destRoot : string) , -AddClassificationSet _Click(in sender : object , in e) 1 +PerformPreprocessing(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode ) -ScoreAllDoc_Click(in sender : object, in e ) -PerformCrossValidation(in defaultCorpus : string , in sourceNode : TreeNode, in destNode : TreeNode) -ClassifyAllDocs _Click (in sender : object, in e) +SetupSTree(in defaultCorpus : string , in sourceFilesNode : TreeNode in STreeNode : TreeNode , ) +DisplayScoresByDoc (in displayView : ListView , in sourceNode : TreeNode, in filepath : string) +ScoreAllDocuments(in sourceDataNode : TreeNode in sTreeNodeName : string) , +ClassifyAllDocuments (in sourceDataNode : TreeNode, in sTreeNodeName : string) +FlagMisClassifiedDocuments (in sourceNodePath : string, in sourceDataNode : TreeNode, in sf : int, in mn : int , in tn : int) +DeleteScores(in parentPath : string) +DeleteSTree(in STreeNode : TreeNode ) +DisplaySTree(in displayTxt : Label, in diplayView : TreeView, in defaultCorpus : string, in dataSource : TreeNode in STreeNode : TreeNode , ) Controllers::SampleSetGenerator +GetMatchInfo(in text : string, in STreeNode : TreeNode : string ) -error : string +CleanupDatabase() -Controls -methodNames : string[] = new string[] {"Census", "Random", "Systematic"} +ErrorMessage() : string 1 1 -CodeToName (in code : int ) : string +Run(in resourcePath : string , in destPath : string, in selectMethod : string) 1 -Controls +MethodNames() : string[] Classifier ::CentralManager -sampler : SampleSetGenerator -preprocessor : Preprocessor Controllers::CrossValidation 1 -crossValidator : CrossValidation -folds : Array[] -dataModelMgr : SuffixTreeManager -noOfFolds : int -outputMgr : DatabaseManager -minFold : int = 2 1 -error : string -maxFold : int = 10 1 -Controls +Create(in key : string , in classNames : string[], in depth : int , in classFiles : FileInfo[][]) : bool -error : string +Contains(in key : string) : bool -Performs 1 +ErrorMessage() : string Output::DatabaseManager +Remove(in key : string) +CrossValidation (in folds : int) +GetClassNames(in key : string) : string[] +Run(in path : string) : Array[] -dbAccess : OLEDB +GetClassScores(in key : string , in className : string, in doc : string) : double[,,] 1 -dbProvider : string +FoldCount() : int +ErrorMessage() : string -dbUserId : string +CentralManager() -Controls -dbPassword : string +GetModel(in key : string) : EMSTreeClassifier -dbName : string +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int Controllers::Preprocessor -ScoresTable : string = "Scores" +Sampler() : SampleSetGenerator 1 -ConfigTable : string = "Config" -stopWordFile : string +Preprocessor() : Preprocessor -ClassWeightsTable : string = "ClassWeights" -punctuationFile : string +CrossValidator() : CrossValidation -ClassifiedTable : string = "qry3a_MaxWScoreClass" -methodNames : string[] = new string[methodCount] +OutputManager() : DatabaseManager -MisClassifyFiles : string = "qry2b_MisClassifiedByFile " -error : string -MatchByClass : string = "zqry2b_matchByClass _Crosstab" +ErrorMessage() : string -error : string 1 1 +Preprocessor() -bOpen : bool -SetupMethodNames() +ErrorMessage() : string -CodeToName code : int) : string (in +DatabaseManager() +Run(in content : string , in type : string) : string +SelectScoresByFile (in parentPathNode : string, in filePath : string) : OleDbDataReader +MethodNames() : string[] +SelectMisClassifiedDocuments (in parentPathNode : string, in sf : int , in mn : int, in tn : int) : OleDbDataReader +SelectClassifiedClass (in sourceNodePath : string, in filepath : string , in sf : int, in mn : int, in tn : int ) : OleDbDataReader 1 +DeleteScores(in ParentNodePath : string) +Provider() : string +UserId() : string +Password() : string 1 -Controls 1 -Has +DatabaseName() : string Classifier ::SuffixTreeManager DataMining::StopWord 1 -createdSTreeList : SortedList -name : string -error : string -stringList : ArrayList = new ArrayList() 1 -Access Database -error : string +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool +Contains(in key : string) : bool +Name() : string Output::OLEDB +Remove(in key : string) +Run(in text : string) : string -oleDbDataAdapter : OleDbDataAdapter +GetClassNames(in key : string) : string[] +ErrorMessage() : string -oleDbConnection : OleDbConnection +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] +StopWord(in filePathName : string) -oleDbInsertCommand : OleDbCommand +ErrorMessage() : string +Add(in filePathName : string) -oleDbDeleteCommand : OleDbCommand +SuffixTreeManager() -AddWord(in targetWord : string) -oleDbUpdateCommand : OleDbCommand -AddSTreeToCache key : string, in sTree : EMSTreeClassifier) : bool (in +Clear() -oleDbSelectCommand : OleDbCommand +GetModel(in key : string) : EMSTreeClassifier +Reset() 1 -Controls + oleDbDataReader : OleDbDataReader +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int +Contains(in word : string ) : bool -command : COMMAND +StringList() : ArrayList -error : string 1 -bOpen : bool + ErrorMessage() : string 1..* -Access Controllers::TreeViewNodeManager + IsOpen() : bool + InsertCommand : string () -error : string EMSTreeClassifier + DeleteCommand() : string +ErrorMessage() : string + UpdateCommand : string () -className : string[] +ChildNameExist(in TargetNode : TreeNode in matchName : string) : bool , + SelectCommand() : string -dictionary : string[] +GetClassFiles(in classFileParent : TreeNode : FileInfo[][] ) + GetReader() : OleDbDataReader -dictionaryByClass : string[][] +GetChildrenNodeNames(in targetNode : TreeNode) : string[] + ExecuteCommand() : bool -mergedTree : EMSTreeClassifier.EMSTree +GetTreeNode (in targetNodeName : string, in Parentnode : TreeNode) : TreeNode -SelectReader() : OleDbDataReader +addToClass(in txt : string, in class : string) +DisplaySTree(in displayView : TreeView, in sTree : EMSTreeClassifier, in classFreqToDisplay : string[]) -UpdateReader() : OleDbDataReader +classIntToName(in classInt : int ) : string +AddItemToTreeView root : TreeNode in childNames : params string[]) : TreeNode (in , -InsertReader() : OleDbDataReader +classNameToInt(in className : string) : int +AddCrossValidationSetsToTreeView(in sourceNode : TreeNode, in content : Array[]) -DeleteReader() : OleDbDataReader +classScore (in example : string, in class : string, in nsf : int , in nmnf : int, in ntnf : int) : double[,,] -PopulateRunNode(in content : Array[], in testSetNum : int, in parentNode : TreeNode) + OLEDB() +maxScore(in a : double[]) : static int -Combine(in array1 : FileInfo[][], in array2 : FileInfo[][]) : FileInfo[][] + Open(in Provider : string, in UserID : string , in Password : string , in DatabaseName : string, in Mode : string) +setDepth(in d : int) +AddItem(in destNode : TreeNode in newNodeName : string, in imageIdx : TreeImages) : TreeNode , + Close() +train(in classTrainingFiles : <unspecified > [][]) : bool -CreateNewNode nodeName : string, in imageIdx : TreeImages) : TreeNode (in Figure 14.Class Diagram 32 of 93
  • 33. 8 DATABASE 8.1 Entities All the data in the system is stored in an Access database. The following describes the organisation of the data that the system will store. 8.1.1 Score Table When a user calls to score a new document or a set of documents, each document is scored against 126 configurations for each class. The data is cached in the score table. 8.1.2 Source Table The source table stores the location properties of documents. This includes the physical pathname of the document and where it is logically located in the display tree. 8.1.3 Configuration Table This configuration table stores the 126 combination of scoring methods used in Pampapathi et al’s study. Each configuration consists of a type of scoring function, match normalisation, and tree normalisation function. 8.1.4 Score Functions Table 33 of 93
  • 34. This table contains the name description of score functions. 8.1.5 Match Normalisation Functions Table This table contains the name description of match normalisation functions. 8.1.6 Tree Normalisation Functions Table This table contains the name description of tree normalisation functions. 8.1.7 Classification Condition Table This table stores any classification conditions to be considered when classifying a document from a particular corpus. 8.1.8 Class Weights Table This table stores the class weights when classifying documents. 8.1.9 Temporary Max and Min Score Table 34 of 93
  • 35. This is a temporary table used to cache the maximum and minimum scores for a class grouped by document, configuration. 8.2 Views The following are some of the main views to assist in querying the main tables for data displayed in the user interface. 8.2.1 Weighted Scores This view obtains the weighted scores by documents and scoring configuration. 8.2.2 Maximum and Minimum Scores This view obtains the maximum and minimum score by document and scoring configuration. 8.2.3 Misclassified Documents This view obtains the misclassified documents and related data. 8.3 Relation Design for the Main Tables The main table of the database is the Scores table. This table contains the scores for each document, scored by different configuration combinations (see the Implementation 35 of 93
  • 36. section for scoring configuration description). 8.3 shows the relationships between the main tables. tTreeNormalisation tMatchNormalisation tScoreFunction PK Index PK Index PK Index Name Name Name 1..1 Config PK,I1 ConfigId 1..1 1..1 FK2 SF FK3 MN FK1 TN SF Name MN Name TN Name *..1 tempMaxMinWScores Source FK2,I2 SourceId *..1 FK1,I1 ConfigId PK SourceId True Class Node Parent Path MaxOfWScore Node Path MinOfWScore File Path Scores *..1 PK ScoreId *..1 FK2,I4,I3 SourceId FK1,I2,I1 ConfigId Score Class True Class Score Figure 15.Table Relations 36 of 93
  • 37. 9 IMPLEMENTATION Due to the large size of the program, this report will not cover all the different implementation details, but instead the discussion will focus on the main classes and highlight some specific implementation. See Appendix B Class Definitions. 9.1 Main User Interface The main form of the user interface is divided into four resizable panes which each display different types of information to the user (see 9.1): • tvExplorer • rtxtView/sTreeView. • lblTreeDetail/listView • rTxtInfo The tvExplorer is a Windows Form TreeView control, which displays the different corpuses available in the system. The information is presented as a hierarchy of nodes, like the way files and folders are displayed in the left pane of Windows Explorer. The rtxtView is implemented as a Windows Forms RichTextBox control. When the user selects a child node in tvExplorer that represents a document, rtxtView will display the content of document. The rtxtView will also allow users to perform dynamic n-gram (sub-string) matching on a document (see section 10.3 Dynamic Sub-String Matching). The sTreeView is implemented as a TreeView control. It shares the same pane as the rtxtView control and is only made visible on the main form (and the rtxtView becomes invisible) when the user requests to display a suffix tree that has been created. At the same time the lblSTreeDetail control, which is implemented as a Windows Form Label control will display description about the suffix tree currently displayed in the sTreeView control. ListView is a Windows Form ListView control which provides information related to the current content of the rtxtView control. RtxtInfo is a RichText control and displays classification summary regarding a document. 37 of 93
  • 38. lblSTreeDetail/listView tvExplorer rtxtInfo rtxtView/sTreeView Figure 16.Main User Interface The main form is implemented as a .NET class called MainForm. 9.1 shows the class members and class interface. Note that there are other Windows Form control classes which were implemented to control the flow of user-system interaction. Section 10 Implementation Specifics will describe one of them in detail, and see Appendix x for all the user interface classes. 38 of 93
  • 39. MainForm -tvExplorer : TreeView -mainMenu1 : MainMenu -mItemResources : MenuItem -mItemAddRCorpus : MenuItem -mitemSelectSampling : MenuItem -mitemPreprocess : MenuItem -mitemCrossValidation : MenuItem -cmenu : ContextMenu -AddClassificationSet : MenuItem -ClassifyAllDocs : MenuItem -ClassifyAllNewDocuments : MenuItem -ScoreAllNewDocs : MenuItem -fdrdialogCorpus : FolderBrowserDialog -components : IContainer -sTreeView : TreeView -pnlSTreeView : Panel -lblSTreeDetail : Label -openFileDialog1 : OpenFileDialog -pnltxtView : Panel -listView 1 : ListView -rtxtView : RichTextBox -pnlExplorerTree : Panel -rtxtInfo : RichTextBox -centralMgr : CentralManager -toolTip1 : ToolTip -splitter 1 : Splitter -splitter 2 : Splitter -splitter 3 : Splitter -splitter 4 : Splitter -menuItem1 : MenuItem +MainForm() #Dispose(in disposing : bool) -InitializeComponent() -Main() -MainForm_Load(in sender : object , in e : EventArgs) -mItemAddRCorpus_Click(in sender : object, in e : EventArgs) -tvExplorer_AfterSelect(in sender : object, in e : TreeViewEventArgs) -mitemSelectSampling_Click (in sender : object, in e : EventArgs) -mitemPreprocess_Click(in sender : object, in e : EventArgs) -mitemCrossValidation_Click(in sender : object, in e : EventArgs) -cmenu_Popup(in sender : object, in e : EventArgs) -CreateSTreeMenuItems () -CreateSTree _Click(in sender : object, in e : EventArgs) -DeleteSTree_Click(in sender : object, in e : EventArgs) -DisplaySuffixTree_Click(in sender : object, in e : EventArgs) -GetDataSourceNode(in sTreeNode : TreeNode) : TreeNode -AddNewDoc_Click(in sender : object, in e : EventArgs) -AddClassificationSet _Click(in sender : object, in e : EventArgs) -rtxtView_MouseUp(in sender : object, in e : MouseEventArgs) -rtxtView_MouseEnter(in sender : object, in e : EventArgs) -ScoreAllDoc_Click(in sender : object, in e : EventArgs) -ClassifyAllDocs _Click(in sender : object, in e : EventArgs) -ClassifyAllNewDocuments _Click(in sender : object, in e : EventArgs) Figure 17.MainForm Class Definition 9.2 Display Manager The DisplayManager contains methods which collaborate with the CentralManager class to obtain information from the classifier. It also contains methods to display the relevant information in the user interface (i.e. the MainForm class). 9.2 shows the class definition. 39 of 93
  • 40. Controllers::DisplayManager -nodeMgr : TreeViewNodeManager -classifier : CentralManager -dbProvider : string -dbUserId : string -dbPassword : string -dbName : string -dbAccessMode : string +AddNode(in destNode : TreeNode, in nodeNames : string[], in imageIdx : TreeImages, in selectedImageIdx : TreeImages) +FindNode(in selectedNode : TreeNode, in nodeName : string) : TreeNode +DisplayBlank () +DisplayFile(in filePathname : string) +SelectSampleCorpus(in defaultCorpus : string, in sourceNode : TreeNode in destNode : TreeNode) , +AddNewClassificationSet(in treeStructure : TreeView, in sourceNode : TreeNode, in destRoot : string) +PerformPreprocessing(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode) -PerformCrossValidation(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode) +SetupSTree(in defaultCorpus : string, in sourceFilesNode : TreeNode, in STreeNode : TreeNode) +DisplayScoresByDoc (in displayView : ListView, in sourceNode : TreeNode in filepath : string) , +ScoreAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string) +ClassifyAllDocuments (in sourceDataNode : TreeNode, in sTreeNodeName : string) +FlagMisClassifiedDocuments (in sourceNodePath : string, in sourceDataNode : TreeNode, in sf : int, in mn : int, in tn : int) +DeleteScores(in parentPath : string) +DeleteSTree(in STreeNode : TreeNode) +DisplaySTree(in displayTxt : Label, in diplayView : TreeView, in defaultCorpus : string, in dataSource : TreeNode, in STreeNode : TreeNode) +GetMatchInfo(in text : string, in STreeNode : TreeNode) : string +CleanupDatabase() Figure 18.DisplayManager Class Definition 9.3 Classifier Classes The classifier components implemented are: • Central Manager class • IClassifierModel interface • STClassifierManager class • STClassifier class At the lowest level of the classifier classes is the STClassifier class which performs generic suffix tree operations such as create suffix tree, train suffix tree, add classes, and score class. The STClassifierManager is a controller or it can also be seen as a wrapper for the STClassifier class. This class contains methods to perform tasks that are more specific to the system. In order to plug in a classifier model into the Central Manager it must implement the IClassifierModel interface. The figures below show the members and class interfaces for the classes. 40 of 93
  • 41. Classifier ::CentralManager -sampler : SampleSetGenerator -preprocessor : Preprocessor -crossValidator : CrossValidation -dataModelMgr : SuffixTreeManager -outputMgr : DatabaseManager -error : string +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool +Contains(in key : string) : bool +Remove(in key : string) +GetClassNames(in key : string) : string[] +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] +ErrorMessage() : string +CentralManager() +GetModel(in key : string) : EMSTreeClassifier +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int +Sampler() : SampleSetGenerator +Preprocessor() : Preprocessor +CrossValidator() : CrossValidation +OutputManager() : DatabaseManager Figure 19.CentralManager Class Definition «interface»DataMining::IClassifierModel +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool +Contains(in key : string) : bool +Remove(in key : string) +GetClassNames(in key : string) : string[] +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] Figure 20.IClassifierModel Interface Definition Classifier ::SuffixTreeManager -createdSTreeList : SortedList -error : string +Create(in key : string, in classNames : string[], in depth : int , in classFiles : FileInfo[][]) : bool +Contains(in key : string) : bool +Remove(in key : string) +GetClassNames(in key : string) : string[] +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] +ErrorMessage() : string +SuffixTreeManager() -AddSTreeToCache(in key : string, in sTree : EMSTreeClassifier) : bool +GetModel(in key : string) : EMSTreeClassifier +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int Figure 21.SuffixTreeManager Class Definition EMSTreeClassifier -className : string[] -dictionary : string[] -dictionaryByClass : string[][] -mergedTree : EMSTreeClassifier.EMSTree +addToClass(in txt : string, in class : string) +classIntToName(in classInt : int) : string +classNameToInt(in className : string) : int +classScore(in example : string, in class : string, in nsf : int, in nmnf : int, in ntnf : int) : double[,,] +maxScore(in a : double[]) : static int +setDepth(in d : int) +train(in classTrainingFiles : <unspecified>[][]) : bool Figure 22.EMSTreeClassitier Class Definition 9.4 Results Output Classes 41 of 93
  • 42. The output components implemented are: • IOutput interface • DatabaseManager class • OLEDB class This project employed an Access database for its data storage component. At the lowest level of the results output classes is the OLEDB class. This class has direct access to the database and has methods to perform generic database commands such as connect to database, close connection to database, insert, and SQL delete, update, and select commands. The DatabaseManager class is a controller or it can also be seen as a wrapper for the OLEDB class to call methods to perform tasks more specific to the system. Notice that the IOutput interface has replaced the previously proposed Output Manager class. It was found that there was no real need to have another class between the Database Manager and the rest of the system, but instead to implement a contract to ensure that the Database Manager would provide a minimum of certain functionalities such as open a data store, close a data store, select, update, and delete. The figures below illustrate the definitions of the components. «interface»DataMining::IOutput +Open() +Close() +InsertScores(in : double[,,], in : string, in : string, in : string, in : string, in : string) +Select(in : string) : string +Update(in : string) +Delete(in : string) +DeleteAll() +IsOpen() : bool Figure 23.IOutput Interface Definition Output::DatabaseManager -dbAccess : OLEDB -dbProvider : string -dbUserId : string -dbPassword : string -dbName : string -ScoresTable : string = "Scores" -ConfigTable : string = "Config" -ClassWeightsTable : string = "ClassWeights" -ClassifiedTable : string = "qry3a_MaxWScoreClass" -MisClassifyFiles : string = "qry2b_MisClassifiedByFile " -MatchByClass : string = "zqry2b_matchByClass _Crosstab" -error : string -bOpen : bool +ErrorMessage() : string +DatabaseManager() +SelectScoresByFile (in parentPathNode : string, in filePath : string) : OleDbDataReader +SelectMisClassifiedDocuments (in parentPathNode : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader +SelectClassifiedClass (in sourceNodePath : string, in filepath : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader +DeleteScores(in ParentNodePath : string) +Provider() : string +UserId() : string +Password() : string +DatabaseName() : string Figure 24.DatabaseManager Class Definition 42 of 93
  • 43. Output::OLEDB -oleDbDataAdapter : OleDbDataAdapter -oleDbConnection : OleDbConnection -oleDbInsertCommand : OleDbCommand -oleDbDeleteCommand : OleDbCommand -oleDbUpdateCommand : OleDbCommand -oleDbSelectCommand : OleDbCommand +oleDbDataReader : OleDbDataReader -command : COMMAND -error : string -bOpen : bool +ErrorMessage() : string +IsOpen() : bool +InsertCommand() : string +DeleteCommand() : string +UpdateCommand() : string +SelectCommand() : string +GetReader() : OleDbDataReader +ExecuteCommand() : bool -SelectReader() : OleDbDataReader -UpdateReader() : OleDbDataReader -InsertReader() : OleDbDataReader -DeleteReader() : OleDbDataReader +OLEDB() +Open(in Provider : string, in UserID : string, in Password : string, in DatabaseName : string, in Mode : string) +Close() Figure 25.OLEDB Class Definition 9.5 Other Controller Classes The SampleSetGenerator, Preprocessor, and CrossValidation classes have a fairly simple class interface. The most important method for each class is to execute the main task it is responsible for. That is to create a sampling set/corpus for the SampleSetGenerator class, Perform pre-processing for the Preprocessor class, and run cross-validation for the CrossValidation class. Controllers::SampleSetGenerator -error : string -methodNames : string[] = new string[] {"Census", "Random", "Systematic"} +ErrorMessage() : string -CodeToName(in code : int) : string +Run(in resourcePath : string, in destPath : string, in selectMethod : string) +MethodNames() : string[] Figure 26.SampleSetGenerator Controllers::Preprocessor -stopWordFile : string -punctuationFile : string -methodNames : string[] = new string[methodCount] -error : string +ErrorMessage() : string +Preprocessor() -SetupMethodNames() -CodeToName(in code : int) : string +Run(in content : string, in type : string) : string +MethodNames() : string[] Figure 27.Preprocessor Class Definition 43 of 93
  • 44. Controllers::CrossValidation -folds : Array[] -noOfFolds : int -minFold : int = 2 -maxFold : int = 10 -error : string +ErrorMessage() : string +CrossValidation(in folds : int) +Run(in path : string) : Array[] +FoldCount() : int Figure 28.CrossValidation Class Definition The SampleSetGenerator class and the Preprocessor class have additional methodology classes plugged into them. As can be seen by each respective class, they have a class member called methodNames. This is an array that stores the method names each method implemented in the system. The Preprocessor class implements three pre-processing methodologies: punctuation removal, stop word removal, and stemming. Each method class has to implement the IMethod interface. Additional sampling methodologies that are plugged into the class can each be built as new classes and has to implement the IMethod interface The SampleSetGenerator class similarly implements three sampling methodologies: census sampling, random sampling, and systematic sampling. See appendix x for all class definitions. Below illustrates the IMethod interface definition and example of the StopWord method that is plugged into the Preprocessor class. «interface» DataMining::IMethod +Name() : string +Run(in text : string) : string Figure 29.IMethod Interface Definition DataMining::StopWord -name : string -stringList : ArrayList = new ArrayList() -error : string +Name() : string +Run(in text : string) : string +ErrorMessage() : string +StopWord(in filePathName : string) +Add(in filePathName : string) -AddWord(in targetWord : string) +Clear() +Reset() +Contains(in word : string) : bool +StringList() : ArrayList Figure 30.StopWord Class Definition 9.6 TreeView Controller Class During development it was discovered that it made sense to implement a separate controller class to manage the nodes displayed in the interface of the tvExplorer control 44 of 93
  • 45. and the sTreeView control. The TreeViewNodeManager was implemented to handle TreeView nodes operations. The class included methods to perform the following tasks: • Create a new TreeNode in a TreeView control • Add TreeNode to a TreeView control • Search for a TreeNode in a TreeView control • Get a child TreeNode Controllers::TreeViewNodeManager -error : string +ErrorMessage() : string +ChildNameExist(in TargetNode : TreeNode, in matchName : string) : bool +GetClassFiles(in classFileParent : TreeNode) : FileInfo[][] +GetChildrenNodeNames(in targetNode : TreeNode) : string[] +GetTreeNode(in targetNodeName : string, in Parentnode : TreeNode) : TreeNode +DisplaySTree(in displayView : TreeView, in sTree : EMSTreeClassifier, in classFreqToDisplay : string[]) +AddItemToTreeView(in root : TreeNode, in childNames : params string[]) : TreeNode +AddCrossValidationSetsToTreeView(in sourceNode : TreeNode, in content : Array[]) -PopulateRunNode(in content : Array[], in testSetNum : int, in parentNode : TreeNode) -Combine(in array1 : FileInfo[][], in array2 : FileInfo[][]) : FileInfo[][] +AddItem(in destNode : TreeNode, in newNodeName : string, in imageIdx : TreeImages) : TreeNode -CreateNewNode(in nodeName : string, in imageIdx : TreeImages : TreeNode ) Figure 31.TreeViewNodeManager Class Definition 9.7 Error Interface The IErrorRecord simply returns an error message. This interface is implemented by all the classes in the system apart from the MainForm class and the DisplayManager class. «interface»DataMining::IErrorRecord +ErrorMessage() : string Figure 32.IErrorRecord Interface Definition 45 of 93
  • 46. 10 IMPLEMENTATION SPECIFICS The system is very much an end-user application and this section will discuss a number a specific user interface implementations developed to satisfy some of the requirements. 10.1 Generic Selection Form Class When a user invokes the application to select a sampling set, the application (or more specifically, the SamplingSetGenerator class) needs to know the following parameter settings in order to perform the task: • Source corpus to select sample from • Type of sampling methodology to use • Destination of new sampling corpus created It was decided to use a popup Windows From or called a dialog box to collect the information from the user. Originally, a prototype for this dialog box was built that looked like the form shown in 10.1. As illustrated, the top combo box lets user select the corpus to use, and the destination to save the sampling sets is specified in the destination text box. The available sampling methods are each represented as a separate check box. This implementation made the form static: it could only be used for selecting sampling methodologies, and if a new sampling method was added/removed to the system it would have been necessary to change the interface also. Select corpus Select/specify destination Choice of sampling methodologies Figure 33.Pre-Processing Methods Dialog Box A rethink of how to make the form more flexible and accommodate future changes led to an alternative design illustrated in 10.1. The check boxes were replaced with two list boxes. The list box on the left contains all the available pre-processing methods the 46 of 93
  • 47. user can use and the list box on the right contains the methods which the user has selected to run. Figure 34.Generic Selection Form Class Used for Pre-processing The form is implemented as a class called SelectDialog. The class was designed to be generic enough to be reusable for similar data request such as selecting a sampling method, and selecting class frequencies to display along with a suffix tree (10.1). When the class is instantiated the class constructor lets the developer customise the properties such as the form name, label names, and populate the left hand side list box. Figure 35.Other Examples of the Generic Selection Form Class 47 of 93
  • 48. 10.2 Visualisation of the Suffix Tree One of the requirements of the system was to be able to visualise the suffix tree. Initially a prototype that was built experimented with creating a custom class library to draw the representation of the suffix tree as a tree like structure shown in 10.2. A red node represented an expanded node, and a blue node represented a non expanded node. Each node label, apart from the root node displayed the character in the suffix tree node and the class frequency information. With this implementation, it was necessary to keep track of the layout of the nodes to make sure it all fitted on to the page. It can be seen that if the frequencies of each node is included in the display, the visual representation becomes more convoluted. Suffix trees used to represent text documents are expected to be large and it will prove a problem to visually represent a suffix tree for a whole training corpus with this technique. Various ways to improve this method of visual representation was thought of, but in the end it was concluded to use a different approach. Figure 36.Suffix Tree Visualisation The final implementation choice was inspired by the Windows Explorer directory tree structure. The TreeView control of the .NET Windows Form library was used. As seen in 10.2 example, the suffix tree visual representation is much clearer. Each node can accommodate the display of a number of class frequencies and not hinder the display clarity. Additionally, this approach is consistent with the display structure used in the tvExplorer control. 48 of 93
  • 49. tvExplorer rtxtView w/ sTreeView Figure 37.Suffix Tree Visualisation Implementation 10.3 Dynamic Sub-String Matching Another requirement was to be able to perform n-gram matching on documents. That is to select a sub-string S1, and verifying whether S1 exists in the related suffix tree, and retrieve the frequency of occurrences in the tree. In the application built for this project, users are able to perform sub-string matching on the content of documents that belong to a corpus with a suffix tree associated with it. Such as a corpuses belonging to a cross-validation set or a corpuses that is a classification set. The chosen method to implement this functionality was aimed at maximising interactiveness with the user. Once a related suffix tree has been created, the user is able to view the content of the document in the rtxtView control (a RichTextBox control of Windows Forms that forms one of the four panels on the main form). By selecting or highlighting a sub-string or text S1 in rtxtView, the system will automatically query the associated suffix tree and display on screen the frequencies of the S1 found in each class (see 10.3). With this user interface design, the functionality was made to be a more dynamic interaction with the user and increase the dynamics of user experience. 49 of 93
  • 50. rtxtView w/ sTreeView Figure 38.N-Gram Matching Example 10.4 User Interaction Warnings Some system events can only be activated once a pre-condition has been satisfied. The system uses different methods to give informative warning messages if the user attempts to perform a task without pre-conditions being fulfilled. For example, the user can only score documents when the associated suffix tree has been created. If the user attempts to score documents before the tree has been constructed, the system will show a message box to warn the user (see 10.4). Not all warnings are displayed as a message box. Message boxes require a response from the user before the next action can be performed: that is the user has to close the message box first. In some situations this user response is not necessary. One of these situations is when the user wants to perform an n-gram match on a document. This can only be done when the associated suffix tree has been created. As seen in the previous section the user can perform dynamic n-gram matching by selecting a text displayed in the rtxtView control. If the associated suffix tree has not been created, a ToolTip control will notify the user that the suffix tree needs to be created first (see 10.4). 50 of 93
  • 51. rtxtView w/ sTreeView Figure 39.Message box Warning Example rtxtView w/ sTreeView Figure 40.ToolTip Warning Example 51 of 93
  • 52. Alternatively the system could simply disable the menu control for an action that is not available at a given point in time, but it will not be intuitive for the user to know what is required to active the functionality. Different ways of displaying informative warning to the user facilitates continuous use-system interaction. Effort has not only been made to develop information warnings to the user, but other general informative messages are also shown to the user which is dependant on the user-system interaction. For instance, when it is possible to perform n-gram matching on a document that is currently selected and viewed in the rtxtView control, the system will notify the user of this functionality as a ToolTip control when the mouse cursor is moved over the rtxtView control (10.4). Other more subtle indications of system states are also used. For example, a red coloured tree icon (10.4) is used to indicate that a suffix tree has not been created. A green coloured tree icon is used to indicate that a suffix tree has been created (10.4). rtxtView w/ sTreeView Figure 41.ToolTip Informative Use Example 52 of 93
  • 53. 11 USER GUIDE 11.1 Getting Started At application start-up there are five base nodes: Resource Sets, Sampling Sets, Pre- Processed Sets, Cross-Validation Sets, and Classification Sets displayed in the top left panel. These five nodes represent the five types of corpuses that the system differentiates. As you interact with the system and perform various tasks new nodes are added to these base nodes as child nodes. Actions are requested using the main menu and tree node sensitive pop-up menus in the top left panel. Figure 1.Main User Form at Application Start Up 11.1.1 Input Data The data loaded into the system as a corpus must follow a standard structure. The documents have to be in text format represented as text files. The documents have to be stored in a location accessible by the system and stored in one main folder directory which contains subfolders to represent the classes. Each class folder should contain the documents which have been pre-labelled to belong to the class the folder represents. See 11.1.1 and 11.1.1 for an example of Ham and Spam email corpus data. 53 of 93
  • 54. Figure 42.Folder Directory Structure Email Example Figure 43.Content of Class Directory Example 11.2 Loading a Resource Corpus To start you can load resource data into the system. You can load more than one set of data. To load an initial corpus into the system follow the steps described below. • Select Actions | 1. Add Resource Corpus on the main menu 54 of 93
  • 55. Figure 44. • Then select the directory where your data is located and click [OK]. Note that the input data has to be in the standard structure as explained in section x. Figure 45. • Once you have selected the data, it will be displayed as two levels of child nodes under the Resource Sets node. The system uses the same names for the child nodes as the folder directory names used in the input data. 55 of 93
  • 56. Figure 46. • Subsequently, you can navigate to the document nodes by expanding the nodes until you reach the leaf nodes, which are also the document nodes. You can view the content of a document by selecting the document node. Figure 47. 56 of 93
  • 57. 11.3 Selecting a Sampling Set You can select a sampling set from the resource sets. The methods currently available are census sampling, random sampling, and systematic sampling. • Select Actions | 2. Select Sampling Set on the main menu Figure 48. • Select the resource set from the combo box that you want select a sample from. Figure 49. 57 of 93
  • 58. • Define the output location where you would like to store the data for the sampling set(s). You can either directly input the directory name in the destination text box, or you can click the browse command button and select the directory from the Browse for folder dialog box. Figure 50. • You can choose to run three different sampling methodologies on the resource set. The left hand list box contains the available sampling methods. Use the arrow command button to select of un-select methods you wish to run. Each method you choose to run will generate a separate sampling set at your chosen destination. Census sampling takes the whole resource set as your sampling set Random sampling is the purest form of probability sampling. Each member of the population has an equal and known chance of being selected. When there are very large populations, it is often difficult or impossible to identify every member of the population, so the pool of available subjects becomes biased. You will need to select the sample size ratio you wish to select. The combo box will give you the ratios available for the resource corpus you have selected. 58 of 93
  • 59. Figure 51. Systematic sampling is also called an Nth name selection technique. After the required sample size has been calculated, every Nth record is selected from a list of population members. As long as the list does not contain any hidden order, this sampling method is as good as the random sampling method. Its only advantage over the random sampling technique is simplicity. Select the Nth number to use for you systematic sampling selection. The combo box will give you the numbers available for the resource corpus you have selected. 59 of 93
  • 60. Figure 52. All the sampling methods are stratified and take into consideration the classes within a corpus when performing sampling. • Once you have selected the data, it will be displayed as two levels of child nodes under the Sampling Sets node. Figure 53. 60 of 93
  • 61. • Subsequently, you can navigate to the document nodes by expanding the nodes until you reach the leaf nodes, which are also the document nodes. You can view the content of a document by selecting the document node. Figure 54. 11.4 Performing Pre-processing Once you have selected your sampling set(s), you can perform pre-processing on you sampling set(s). There are currently three methods available: stop word removal, punctuation removal and stemming • Select Actions | 3. Run Pre-processing on the main menu 61 of 93
  • 62. Figure 55. • Select the sampling set and the destination you would like to pre-processed corpus to be saved. Then select the types of text pre-processing you wish to perform. Figure 56. 62 of 93
  • 63. • Once you have selected the data, it will be displayed as two levels of child nodes under the Pre-Processing Sets node. Figure 57. • Subsequently, you can navigate to the document nodes by expanding the nodes until you reach the leaf nodes, which are also the document nodes. You can view the content of a document by selecting the document node. 63 of 93
  • 64. Figure 58. 11.5 Running N-Fold Cross-Validation Once you have selected your sample and performed pre-processing, you can analyse the data in terms of using it as a training set for a classifier by running n-fold cross validation. N-Fold cross-validation analysis will split the corpus into N blocks of data. Each block will have a training set data and test set data. The former is used to train the classifier. One a classifier has been built, you can test the classifier with the test set data. 11.5.1 Set Up Cross-Validation Set • Select Actions | 4. Run N-Fold Cross-Validation on the main menu 64 of 93
  • 65. Figure 59. • Select the pre-processed corpus to use and the number of N-fold. Figure 60. 65 of 93
  • 66. • Each fold, or known as a run contains a Training Set and a Test Set . There is also an empty node for loading in new documents , and a node representing the status of the corresponding suffix tree . Figure 61. • For both the Training Set and Test Set nodes you can navigate to the document nodes by expanding the nodes until you reach the leaf nodes, which are also the document nodes. You can view the content of a document by selecting the document node. 66 of 93
  • 67. Figure 62. 11.5.2 Perform experiments on the data 11.5.2.1 Create the Suffix Tree You can create a suffix tree using the Training Set of the cross-validation run as the training data. To do this, follow the steps below. • Select the suffix tree node and right click the mouse. Select Create Suffix Tree… menu item 67 of 93
  • 68. Figure 63. • Select the suffix tree depth. Once the tree has successfully been created, the suffix tree icon changes into a green colour. Figure 64. 68 of 93
  • 69. 11.5.2.2 Display Suffix Tree The suffix tree can only be displayed if the tree has been created. A red coloured tree icon indicates that a suffix tree has not been created, and a green coloured tree icon means the tree has been created. If you attempt to display the suffix tree without it being created, a message box will notify you that the suffix tree needs to be created first. See Section 11.5.2.1 on how to create a suffix tree. Below describes how to display the suffix tree. • Select the suffix tree node and right click the mouse. Select Display Suffix Tree… menu item Figure 65. • Select the class frequencies you would like to be displayed with the suffix tree 69 of 93
  • 70. Figure 66. • The top right panel shows information about the suffix tree. The bottom right panel displays the visualisation of the suffix tree. Expand the nodes to see each level. 70 of 93
  • 71. Figure 67. 11.5.2.3 Delete Suffix Tree • Select the suffix tree node and right click the mouse. Select Delete Suffix Tree… menu item Figure 68. 11.5.2.4 N-Gram Matching When the suffix tree icon is green i.e. when the suffix tree has been created, you can perform n-gram matching on the documents in the Test Sets. N-gram matching is when lets you select a sub-string and match it again the suffix tree and query the frequency of occurrences of the sub-string for each class. • Select a document under the Test Set node . The content of the document will be display in the bottom right pane. 71 of 93
  • 72. Figure 69. • Select a sub-string within the text you want to match against the suffix tree. Note that the maximum length of the string that will exist in the suffix tree is the same as the dept of the tree you specified when you created the tree. For example, if you created a suffix tree with a depth of 5, and there will be no occurrences exiting in the suffix tree for a string that is 6 characters in length. Figure 70. 72 of 93
  • 73. 11.5.2.5 Score Documents You can score documents each class, once the suffix tree has been built. The system will calculate 126 different configurations of scoring metrologies. All scored are normalised. • Select a Test Set node and right click the mouse. Select Score All Documents menu item Figure 71. • Once the documents have been scored, you can view the results for each document by simply selecting the document nodes. 73 of 93
  • 74. Figure 72. 11.5.2.6 Classify documents You can classify documents under the Test Set node once they have been scored. The system will flag any miss-classified documents. • Select a Test Set node and right click the mouse. Select Classify All Documents… menu item 74 of 93
  • 75. Figure 73. • Specify the minimum score lead value. A document is given a score for each class and the minimum score lead value is the value that you want the highest class score to lead all the other class scores before it is classified under the highest class score. The scores for each class can be weighted. Specify the weights for each class. Then specify the scoring configuration you want to use. Figure 74. • When the documents are classified any miss-classified documents will be flagged with a red document icon . You can select the files and view the scores in more detail. You can also drill down and do n-gram matching against the documents to analyse the reason for any miss-classified documents. 75 of 93
  • 76. Figure 75. 11.5.2.7 Add New Document to Classify • Select a New Documents node and right click the mouse. Select Add… menu item. Figure 76. 76 of 93
  • 77. • Select the document you want to add. You can select multiple documents if you wish. Figure 77. • Specify the minimum score lead value. A document is given a score for each class and the minimum score lead value is the value that you want the highest class score to lead all the other class scores before it is classified under the highest class score. The scores for each class can be weighted. Specify the weights for each class. Then specify the scoring configuration you want to use. The system will automatically score and classify the document(s). 77 of 93
  • 78. Figure 78. • Once the new document(s) has been added you can view the scores and its content by clicking on the document node. The bottom left pane will display information about the selected document and classification details. You can also perform n-gram matching. 78 of 93
  • 79. Figure 79. 11.6 Creating a Classifier You can create a new classifier using corpuses with the icon as training data. These are the corpuses under the Sampling Sets node and Pre-processed Sets node. • Select a corpus with the icon and right click the mouse. Select Add to Classification Set menu item. A new set of nodes is displayed under the Classification base node. The set contains the training data used to build the classifier, a node for loading new documents, and a node to represent the associated suffix tree. Figure 80. • Similar to the functionalities available to perform on a cross-validation set, you can create, display, and delete the suffix tree. See section 11.5.2 Perform experiments on the data for detail. 79 of 93
  • 80. Figure 81. • Like for cross-validation sets, you can add new document, score and classify the documents. You can then view the content, scores and perform n-gram matching on the documents. See section 11.5 Running N-Fold Cross-Validation for details. Figure 82. 80 of 93
  • 81. 12 TESTING The main input data used to test the data was sourced from data used in Pampapathi et al’s study. The data is emails grouped into two classes: ham (legitimate emails) and spam (unsolicited emails). Testing was carried out throughout the development of the tool. Every functionality implementation was followed by tests on the functionality to make sure it works and that new developments did not impact previously implemented codes. Initial design of the system included storing an internal representation of the data loaded into the system. There was a Corpus class, which could contain N number of classes called Class. A Class class contained an array which stored the pathnames of files/documents contained in each class of the corpus. As a result of the testing it was found that keeping an internal representation of the data took up significant computational time. The user interface was designed to show this information already in the TreeView structure (12). Therefore, to save extra computational time the information for sets of data was retrieved from the TreeView control instead, and the classes were dropped from the design. This design decision had other implications which are discussed in next section. Figure 83.TreeView Control at the Top Left Panel of the User Interface Actual using the system also helped to identify where system warnings and messages should be appropriately shown. Different icons were used in the GUI to represent different types of data sets. A menu item to close the application was also added (12). 81 of 93
  • 82. Figure 84.Application Close The system is not limited in classifying ham and spam email documents as used for testing in this project, but extends to data with more than two classes. 82 of 93
  • 83. 13 CONCLUSION 13.1 Evaluation One of the first and most important question after a software development is ‘Does the system fulfil the original requirements?’ The aim was to build a management and visualisation tool for text mining applications. Apart from providing the functionalities required for such a tool, the system had to employ a flexible framework that would allow additions or substitution of the underlying components with relative ease. The system built for this project has managed to fulfil the core requirements and functions well. It provides a software infrastructure for a data mining system based on machine learning that automatically manages and refines the knowledge discovery and data mining process. The system hides the complexity involved in document categorisation and provides a single platform design for users in the research community to test and tune a classifier for their research domain. It has been built to be a wrapper to provide the two-step process involved in classification, and a platform to carry out classifier validation and use. Each system component is built as a separate class and can be easily replaced or additions added to provide different functionalities. For some components to be added to the system it has to satisfy a contract that is defined by an interface. For example, to add a new classifier model next to the existing suffix tree classifier, the new class that is plugged into to the system has to implement the IClassifierModel interface. There have been changes in the design of the system during the course of the project. Firstly, a new class, the TreeViewNodeManager was added to handle the TreeView controls used in the GUI. Secondly, a corpus class and category class were dropped. These two classes were intended to represent the data the system used. Instead the data logic was kept within the TreeView control structure used in the GUI. This changed resulted in increased system processing speed, and reduced duplicate of data management. However, it also meant that the current implemented GUI is tightly integrated with the system by the Windows Forms implementation, which does not satisfy the flexible framework that the system aims to provide. The design change decision was taken because since another aim of the project was to provide a visualisation tool it was felt that a graphical interface is an appropriate choice and increasing system response in a user-end type of application is important. The Windows Forms library in .NET provides powerful and wealth of existing capabilities to implement sophisticated GUIs. Other forms of visualisation of the classification model such as visually representing the suffix tree were experimented by building custom dll. This approach took up time and was essentially reinventing a wheel that was already available in .NET. It is unlikely that future work on the system would implement another user interface such a command line, but future requirements would be more concentrated on ways to improve the current GUI. If indeed there is a need to change the user interface, only the user interface, the DisplayManager need to be changed significantly, and the TreeViewNodeManager class discarded. The requirements outlined in the Requirements section have been fulfilled, though there is scope for improving the application. The next section list out some suggested future work on the tool. 83 of 93
  • 84. 13.2 Future Work A number of suggested improvements and additional functionalities are suggested below. • Database connection settings such as the database name, location, user id etc are currently hard coded in the DisplayManager class which sets the class members in the DatabaseManager class. This could be incorporated in the user interface to request the settings from the user. This will not only remove the hard coding, but if another type of output method in added to the system, the user can select which output store to use. • Evaluating the performance of a classifier is usually examined by evaluating the accuracy of the classification. Other evaluation approaches may involve determining the space and time overhead used. Although these are usually secondary, but are useful especially when new classification models are added. Future work could incorporate these techniques into the tool. • The system can employ a configuration file (e.g. XML file) to save settings for the system last executed session. So that on application close and the next time it is opened it will automatically setup the interface with the last session. • Going a step further, the configuration file could be used to store the current settings/system state during runtime. If this is implemented using a standard format such as XML, the same standard could be used as format of message exchanged between classes. • The ability to remove documents from a corpus. A user may find that a document that has been pre-classified to belong to a particular class is actually not a good representation for that class and should be removed from the training corpus. • Follow on from the previous item it may be useful not only to remove documents from a corpus but to move it to another corpus. • With a generic design the previous functionality should also with ease implement the functionality to allow users to add a document to a corpus. • Future work could develop the system to handle training data that has N level of hierarchies. A group of data is often found to belong to another higher level group for example, the classification automobile has cars, vans etc. Cars can be further broken down into the type of cars of make of cars, and then the car model. • Implementing a distributed system, running on several servers at once. This framework will provide support for high-throughput data processing. On the onset of the project research into ways of distributed systems was done. It was found that the remoting technology was best suited for this project. However it is complex to set up and the constraint in time did not allow for this form of architecture to be implemented in this project. • The presentation of the graphical user interface could be improved. For example the main form could be further divided into corpus types so that each screen shows grouped information and it will also be clearer. 84 of 93
  • 85. • The user guide could be integrated in the user interface, with search and index options. • The classifier is based on machine learning and the training data to such a classifier has to be pre-labelled in order to use it. The system could incorporate a whole new set of functionalities which uses clustering to find undetected group structures. As a result the input data loaded into the system does not necessarily have to be pre-classified. Some of these changes can be made without too much trouble and would constitute an extra few months of work in total. Other suggestions are more involved and require more time to implement. The project was a good challenge and through the course of the project I have learnt a lot about software development. 85 of 93
  • 86. 14 BIBLIOGRAPHY [1] Rajesh M. Pampapathi, Boris Mirkin, Mark Levene. A suffix Tree Approach to Text Categorisation Applied to Spam Filtering. Available online: http://arxiv.org/abs/cs.AI/0503030, February 2005. [2] Donald P. Ryan. Ancient Languages and Scripts. Webpage (last accessed 10 August 2005): http://www.plu.edu/~ryandp/texts.html [3] Fabrizio Sebastiani. Text Categorization. In Alessandro Zanasi (ed.), Text Mining and its Applications, WIT Press, Southampton, UK, 2005, p109-129. [4] Fabrizio Sebastiani. A Tutorial on Automated Text Categorisation. Istituto di Elaborazione dell’Informazione. Consiglio Nazionale delle Ricerche. Via S. Maria, 46-56126 Pisa (Italy), 1999. [5] Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Nedellec C & Rouveirol C (eds.) Proceedings of ECML-98, 10th European Conference on Machine Learning. Lecture Notes in Computer Science series, no. 1398 Heidelberg: Springer Verlag 1998. p137-142. [6] Gill Benjerano and Golan Yona. Variations on Probabilistic Suffix Trees: Statistical Modelling and Prediction of Protein Families School of Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel and Department of Structural Biology, Fairchild Bldg. D-109, Stanford University, CA, 94305, USA. [7] Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), pp. 1-47, 2002. [8] Yiming Yang. An Evaluation of Statistical Approached to Text Categorization. Information Retrieval, vol. 1, No 1/2., Kluwer Academic Publishers pp. 69-90, 1999. [9] C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing MIT Press, 1999. [10]M. F. Porter. An algorithm for suffix stripping. In Readings in Information Retrieval, pp 313-316. Morgan Kaufmann Publishers Inc, 1997. 86 of 93
  • 87. Note: the algorithm was originally described in Porter, M. F., 1980, An algorithm for suffix stripping, Program, 14(3) : 130-137. It has since been reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4 [11]Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press. 1997. [12]Jiawei Han. Data Mining: Concepts and Techniques. Morgan Kaufmann, Academic Press. Lonond. 2001. [13]Margaret H. Dunham. Data Mining and advance topics. Prentice Hall. London. 2003. [14]Craig Larman. Applying UML and Patterns, An Introduction to Object-oriented Analysis and Design and the Unified Process (2nd Ed). Prentice Hall PTR, US, 2002. [15]Martin Fowler. UML Distilled, A Brief Guide to the Standard Object Modeling Language (3rd ed). Pearson Education Inc, Boston, 2004. [16]Dave Thomas, Agile Programming: design to accommodate Change. IEE Software www.computer.org/software, vol. 22, No 3 May/June 2005. [17]Peter Drayton, Ben Albahari and Ted Neward. C# in a nutshell: a desktop quick reference. O’Reilly & Associates. California. 2002. [18]Jesse Liberty. Programming C#. O’Reilly. Beijing Cambridge. 2003. [19]C 87 of 93
  • 88. 15 APPENDIX A DATABASE Last Score ID: SELECT Max(Scores.ScoreId) AS MaxOfScoreId FROM Scores; Last Source ID: SELECT Max(Source.SourceId) AS MaxOfSourceId FROM Source; Weighted Scores View Query: SELECT Scores.ScoreId, Scores.SourceId, Scores.ConfigId, Scores.[Score Class], Scores.[True Class], Scores.Score, [Score]*ClassWeights.Weight AS WScore FROM ( Source INNER JOIN ClassWeights ON Source.[Node Parent Path] = ClassWeights.[Node Path]) INNER JOIN Scores ON Source.SourceId = Scores.SourceId WHERE (((Scores.[Score Class])=[Class]) ); Maximum and Minimum Scores View Query: SELECT MaxWTable.SourceId, MaxWTable.ConfigId, MaxWTable.[True Class], MaxWTable.MaxOfWScore, MaxWTable.MinOfWScore FROM ( SELECT ws2.SourceId, ws2.ConfigId, ws2.[True Class], Max(ws2.WScore) AS MaxOfWScore, Min(ws2.WScore) AS MinOfWScore FROM WeightedScores AS ws2 GROUP BY ws2.SourceId, ws2.ConfigId, ws2.[True Class] ) AS MaxWTable WHERE ( ((MaxWTable.MaxOfWScore) Not In ( SELECT Count(ws3.WScore) AS CountOfWScore FROM WeightedScores AS ws3 88 of 93
  • 89. GROUP BY ws3.SourceId, ws3.ConfigId, ws3.[True Class], ws3.WScore HAVING ( Count(ws3.WScore) >1) )) ); Misclassified Documents: SELECT Source.[Node Path], Source.[File Path], Config.SF, Config.MN, Config.TN, t2.[True Class], t2.[Score Class], t2.WScore FROM ( (qry2a_MaxMinLeadWScoresByFile AS t1 INNER JOIN qry2a_MaxMinLeadWScoresByFile AS t2 ON (t1.SourceId = t2.SourceId) AND (t1.ConfigId = t2.ConfigId)) INNER JOIN Source ON t1.SourceId = Source.SourceId) INNER JOIN Config ON t1.ConfigId = Config.ConfigId WHERE (((t2.[True Class])<>[t2].[Score Class]) AND ((t1.[True Class])=[t1].[Score Class]) AND ((t1.NScore)<[t2].[LeadDiff])); 89 of 93
  • 90. 16 APPENDIX B CLASS DEFINITIONS Intefaces: «interface»DataMining::IClassifierModel «interface» DataMining::IMethod +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool +Contains(in key : string) : bool +Name() : string +Remove(in key : string) +Run(in text : string) : string +GetClassNames(in key : string) : string[] +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] «interface»DataMining::IOutput +Open() +Close() +InsertScores(in : double[,,], in : string, in : string, in : string, in : string, in : string) +Select(in : string) : string +Update(in : string) +Delete(in : string) «interface»DataMining::IErrorRecord +DeleteAll() +IsOpen() : bool +ErrorMessage() : string Data Types: DataType::TreeNormalisation DataType::MatchNormalisation «enumeration»DataType::TreeImages -averageFreq : string = "average frequency" -length : string = "length" +ResourceRoot = 3 -averageL1Freq : string = "average L1 frequency" -none : string = "none" +SamplingRoot = 4 -density : string = "density" -permutation : string = "permutation" +PreprocessRoot = 5 -none : string = "none" -itemCount : int = 3 +CrossValidationRoot = 6 -size : string = "size" -all : string[] = new string[itemCount] +ClassificationRoot = 7 -totalFreq : string = "total frequency" -GetAllItems() : string[] +Corpus = 13 -itemCount : int = 6 +All() : string[] +CorpusSel = 14 -all : string[] = new string[itemCount] +Length() : string +Class = 0 -GetAllItems() : string[] +None() : string +ClassSel = 1 +All() : string[] +Permutation() : string +Document = 2 +AverageFreq() : string +STreeNotCreated= 9 +AverageL1Freq() : string +STreeCreated = 10 +Density() : string +TestDocument = 11 +None() : string +NewDocument = 12 +Size() : string +NewDocumentAdded = 21 +TotalFreq() : string +MisClassifyDocument = 20 +ClassificationSet = 15 +SamplingSet = 16 DataType::ScoringFunctions +PreprocessingSet = 16 -constant : string = "constant" +ClassificationData = 17 -cosine : string = "cosine" +TestSet = 18 DataType::RootNodes +TrainingSet = 19 -linear : string = "linear" -logit : string = "logit" -resource_corpus : string = "Resource Sets" -root : string = "root" -sampling_corpus : string = "Sampling Sets" -sigmoid : string = "sigmoid" -preprocess_corpus : string = "Pre-Processed Sets" -square : string = "square" -crossValidation _set : string = "Cross-Validation" -itemCount : int = 7 -classification : string = "Classification " -all : string[] = new string[itemCount] +ResourceName() : string -GetAllItems() : string[] +ResourceIdx() : int +All() : string[] +SamplingName() : string +Constant() : string +SamplingIdx() : int +Cosine() : string +PreprocessName() : string +Linear() : string +PreprocessIdx() : int +Logit() : string +CrossValidationName() : string +Root() : string +CrossValidationIdx() : int +Sigmoid() : string +ClassificationName() : string +Square() : string +ClassificationIdx () : int 90 of 93
  • 91. User Interfaces: UI::SelectScoringMethod UI::MainForm -btCancel : Button -tvExplorer : TreeView -fdrdialogDest : FolderBrowserDialog -mainMenu1 : MainMenu -groupBox4 : GroupBox -mItemResources : MenuItem -label1 : Label -mItemAddRCorpus : MenuItem -txtLeadVal : TextBox -fdrdialogCorpus : FolderBrowserDialog -components : Container = null -imageList1 : ImageList -groupBox1 : GroupBox -components : IContainer +lstScoringFunc : ListBox -mitemVisualise : MenuItem -groupBox2 : GroupBox -mitemCreateSTree : MenuItem +lstMatchNorm : ListBox -cmenu : ContextMenu -groupBox3 : GroupBox -splitter1 : Splitter +lstTreeNorm : ListBox -sTreeView : TreeView -btOK : Button -pnlSTreeView : Panel -datagridClassWeights : DataGrid -lblSTreeDetail : Label -openFileDialog1 : OpenFileDialog -splitter2 : Splitter -dataTable : DataTable -openFileDialog1 : OpenFileDialog -dataSet : DataSet -AddClassificationSet : MenuItem -sqlClassWeightList : string[] -pnltxtView : Panel -sqlScoreLead : string -toolTip1 : ToolTip -sourceParentPath : string -listView1 : ListView -sourcePath : string -splitter3 : Splitter -nCount : int -rtxtView : RichTextBox +SQLWeightList() : string[] -ClassifyAllDocs : MenuItem +SQLScoreLead() : string -RemoveFile : MenuItem -SourcePath() : string -ScoreAllNewDocs : MenuItem -SourceParentPath() : string -ClassifyAllNewDocuments : MenuItem -Count() : int -pnlExplorerTree : Panel +SelectScoringMethod(in reader : OleDbDataReader, in leadVal : string, in sourceParentNodePath : string, in sourceNodePath : string, in ScoringFunctions : string[], in matchNormalisations : string[], in treeNormalisations : string[]) -splitter4 : Splitter -PopulateClassWeightBox(in reader : OleDbDataReader) -rtxtInfo : RichTextBox #Dispose(in disposing : bool) -centralMgr : DisplayManager -InitializeComponent() -mitemSelectSampling : MenuItem -btOK_Click(in sender : object, in e : EventArgs) -mitemPreprocess : MenuItem -BuildUpdateScoreLeadSQL() -mitemCrossValidation : MenuItem -BuildUpdateClassWeightsSQL() -menuItem1 : MenuItem -mItemExit : MenuItem -msg : ToolTip +MainForm() UI::SelectChoiceDialog #Dispose(in disposing : bool) -fdrdialogDest : FolderBrowserDialog -InitializeComponent() +cbSystematic : ComboBox -Main() -components : Container = null -MainForm_Load(in sender : object, in e : EventArgs) -btCreateSampleCorpus : Button -mItemAddRCorpus_Click(in sender : object, in e : EventArgs) -lblSelType : Label -tvExplorer_AfterSelect(in sender : object, in e : TreeViewEventArgs) -btCancel : Button -mitemSelectSampling_Click (in sender : object, in e : EventArgs) +SelectChoiceDialog(in formName : string, in labelName : string, in list : ArrayList) -mitemPreprocess_Click(in sender : object, in e : EventArgs) #Dispose(in disposing : bool) -mitemCrossValidation_Click(in sender : object, in e : EventArgs) -InitializeComponent() -cmenu_Popup(in sender : object, in e : EventArgs) -btCancel_Click(in sender : object, in e : EventArgs) -CreateSTreeMenuItems () -CreateSTree_Click(in sender : object, in e : EventArgs) -DeleteSTree_Click(in sender : object, in e : EventArgs) -DisplaySuffixTree_Click(in sender : object, in e : EventArgs) UI::SelectionDialog UI::SelectTextDialog -GetDataSourceNode(in sTreeNode : TreeNode) : TreeNode -groupBox1 : GroupBox -fdrdialogDest : FolderBrowserDialog -AddNewDoc_Click(in sender : object, in e : EventArgs) -btEditPunctuation : Button -components : Container = null -AddClassificationSet _Click(in sender : object, in e : EventArgs) -btRun : Button -btCreateSampleCorpus : Button -rtxtView_MouseUp(in sender : object, in e : MouseEventArgs) -btCancel : Button -lblType : Label -rtxtView_MouseEnter(in sender : object, in e : EventArgs) +txtDest : TextBox +txtValue : TextBox -ScoreAllDoc_Click(in sender : object, in e : EventArgs) -btBrowseDest : Button -btCancel : Button -ClassifyAllDocs _Click(in sender : object, in e : EventArgs) -fdrdialogDest : FolderBrowserDialog -ClassifyAllNewDocuments_Click(in sender : object, in e : EventArgs) +SelectTextDialog(in formName : string, in labelName : string) -RemoveFile_Click(in sender : object, in e : EventArgs) +lblSelType : Label #Dispose (in disposing : bool) +cbSelTypeOptions : ComboBox -mItemExit_Click(in sender : object, in e : EventArgs) -InitializeComponent() -lstAvail : ListBox -btCancel_Click(in sender : object, in e : EventArgs) +lstSelected : ListBox -txtValue_TextChanged(in sender : object, in e : EventArgs) -btSelect : Button -btRemove : Button -label1 : Label -label2 : Label -grpDest : GroupBox -components : Container = null -centralMgr : DisplayManager UI::CrossValidationDialog -systematicStep : string -btRun : Button -randonRatio : string -btCancel : Button +SystematicStep() : string -fdrdialogDest : FolderBrowserDialog +RandonRatio() : string +lblSelType : Label +SelectionDialog(in d : DisplayManager, in formName : string, in baseType : string, in itemList : string[], in defaultItem : string, in destEnable : bool, in selectionsAvail : string[]) -components : Container = null -PopulateListBox(in selectionsAvail : string[]) +cbTypeOptions : ComboBox #Dispose(in disposing : bool) -label1 : Label -InitializeComponent() +cbFolds : ComboBox -btBrowseDest_Click(in sender : object, in e : EventArgs) -centralMgr : DisplayManager -btRun_Click(in sender : object, in e : EventArgs) +CrossValidationDialog(in d : DisplayManager, in formName : string, in baseType : string, in itemList : string[], in defaultItem : string) -btSelect_Click(in sender : object, in e : EventArgs) #Dispose(in disposing : bool) -btRemove_Click(in sender : object, in e : EventArgs) -InitializeComponent() Methods: DataMining::Stemmer DataMining::Punctuation -name : string -name : string -error : string -stringList : ArrayList = new ArrayList() -b : char[] -error : string -i : int +Name() : string -i_end : int +Run(in text : string) : string -j : int +ErrorMessage() : string -k : int +Punctuation(in filePathName : string) -INC : int = 50 +Add(in filePathName : string) +Name() : string -AddWord(in targetWord : string) +Run(in text : string) : string +Clear() +ErrorMessage() : string +Reset() +Stemmer() +Contains(in word : string) : bool +add(in ch : char) +StringList() : ArrayList +add(in w : char[], in wLen : int) +ToString() : string +getResultLength() : int +getResultBuffer() : char[] -cons(in i : int) : bool DataMining::StopWord -m() : int -vowelinstem() : bool -name : string -doublec(in j : int) : bool -stringList : ArrayList = new ArrayList() -cvc(in i : int) : bool -error : string -ends(in s : string) : bool +Name() : string -setto(in s : string) +Run(in text : string) : string -r(in s : string) +ErrorMessage() : string -step1() +StopWord(in filePathName : string) -step2() +Add(in filePathName : string) -step3() -AddWord(in targetWord : string) -step4() +Clear() -step5() +Reset() -step6() +Contains(in word : string) : bool +stem() +StringList() : ArrayList 91 of 93
  • 92. Utility: IO::FileSystem -dir : DirectoryInfo +GetDirName(in dirPath : string) : string +GetDirRoot(in dirPath : string) : string +CopyDirectory(in src : string, in dest : string, in maxDepth : int, in level : int) Utility::CommonDivisor +GetCommonDivisors(in numbers : int[]) : ArrayList -FindCommonDivisors(in numsToCheck : ArrayList, in nums : ArrayList) : ArrayList Utility::RandomNumberGenerator -r : Random = new Random((int)DateTime.Now.Ticks) +Next(in bound : int) : int 92 of 93
  • 93. 17 APPENDIX C SOURCE CODE 93 of 93