1. A hybrid approach to discover high-level category structure of a statistical
website.
Dihui Lu
School of Information and Library Science, University of North Carolina at Chapel Hill, lud@ils.unc.edu
Junliang Zhang
School of Information and Library Science, University of North Carolina at Chapel Hill,
junliang@email.unc.edu
INTRODUCTION documents as well. After that, all words were stemmed
One way to help people navigate and understand a large using Porter stemming algorithm (Porter, 1980). Among
information collection is to provide a yahoo-like category these documents, 1063 weekly review articles from BLS
structure. In the Govstats project website with subject headings were used for supervised
(http://ils.unc.edu/govstats), an unsupervised classification classification. 45 unique subject-heading categories were
method was used to cluster the web documents from identified and the top 13 labels with the most labeled
Bureau of Labor Statistics website (http://www.bls.gov) documents were chosen for classification (other subject
into top-level category structure. Even though some heading were discarded due to the fact that they were
promising results were generated, the heterogeneity of assigned to too few documents). The labels of these
existing html documents (include tables, publications, categories and the corresponding numbers of documents
standard occupational codebook etc.) made it difficult for are displayed in table 1. Document-term matrix was
the unsupervised learning approach to work very well. generated and used as input for Weka, which is a collection
Actually, there are some pre-labeled documents from “The of machine learning algorithms for data mining task (http://
Editor’s Desk” on the BLS website www.cs.waikato.ac.nz/~ml/weka/) (Witten, & Frank,
(http://www.bls.gov/opub/ted/), which can be utilized as 2000).
training data set for supervised classification. However, the
topics of the training data set can not fully represent those Table1. 1063 training documents for 13 categories.
of the whole website. Hence, we employed a hybrid
approach which combined supervised learning approach Subject Category Number of Documents
with unsupervised learning technique for category Benefits 22
discovery. Both NaiveBayes and Supported Vector Compensation costs 85
Machine learning algorithms were used to build the Consumer 51
classifiers. Based on the average accuracy of the Earnings and wages 151
algorithms, we applied the classifier generated from SVM Employment 225
to classify the rest of the unlabelled web documents. Due to Industry studies 29
the fact that the training set was not a representative set of Labor force 22
the site, part of the unlabelled documents had low Manufacturing 27
probabilities to be classified into any of the known Occupational safety 58
categories. As a complementary approach, K-means was Occupations 20
then tried to cluster these documents. By using this hybrid Prices 206
approach, we have identified three more categories. Further
Productivity 55
research will focus on the evaluation of this hybrid
Unemployment 111
approach and compare it with current human generated
categories.
In our preliminary study, both NaïveBayes and Support
Vector Machine were used for building the classifier
METHOD models on training data set. Since the performance of a text
There are totally 17068 documents crawled from the
classifier also depends on the selection of an appropriate
BLS website as of Jan 2004. The goal is to discover the
features used for the representation of the documents (such
category structure and classify these documents under this
as taking titles as the representation of documents, or
structure. The stop word list was used to reduce the trivial
taking full text as the representation of documents), we
terms for all documents. The numbers and other non-
studied the impact of the different algorithms and features
textual symbols such as html tags were removed from the
on the classification performance. Ten fold cross validation
2. were used to compare the classification performance.
Based on average accuracy and IR precision/recall (see It seems that the cluster 0, 2, and 3 generate some new
table 2), we chose SVM as out classification algorithm to concepts which are not included in previous 13 categories.
build the model and classify the rest of web documents However, these concepts for each cluster are manually
represented by full text (title and body text). extracted by reading randomly selected documents. We
hope we can get more comprehensive ones by employing
Table 2. Preliminary results from NaiveBayes and SVM automatic concept extraction techniques.
Further study will include the evaluation of the
Naïve Bayes SVM automatically generated categories and compare it with
Features Title Title Title Title current human generated categories. In addition, since we
+body +body have some promising results from the combination of
Accuracy 76.7 76.2 81.8 83.5 Latent Semantic Indexing and SVM for classification of the
IR 0.54 0.68 0.56 0.77 web documents, we plan to apply LSI+SVM to improve the
precision classification accuracy. Besides, we intend to try other
IR recall 0.66 0.81 0.46 0.73 more sophisticated clustering algorithms such as EM
(Expectation-maximization) algorithm.
Since the training set from the weekly review articles of the
site was not a representative data set of whole web site, the ACKNOWLEDGMENT
classifier built from the training set would have low Thanks to Jonathan Elsas for writing part of the code. The
performance on some of the web documents. The SVM work is supported by NSF grant EIA 0131824.
classifier assigned each document probabilities of
belonging to each category. We identified 1988 documents REFERENCE
which had low probabilities to any of the categories. K- Efron, M., Marchionini, G. and Zhang, J. (2003).
means clustering algorithm was then used to cluster the Implications of the Recursive Representation problem for
documents into four clusters. The statistics on the results is Automatic Concept Identification in On-line Governmental
shown in table 3. The concepts which are manually Information. ASIST SIG-CR Workshop (Long Beach, CA,
extracted for each cluster are displayed in the table 4. October 18, 2003).
Porter, M.F., 1980, An algorithm for suffix stripping,
Table 3. Number of documents in each cluster
Program, 14(3) :130-137
Cluster number Number of docs Witten, I. H., and Frank, E. (2000). Data Mining: Practical
0 482 machine learning tools with Java implementations. Morgan
1 832 Kaufmann, San Francisco.
2 105
3 569
Table 4. Concepts extracted from each cluster
Cluster number Concepts
0 Career, secure, retire
1 Benefit, earn
2 Consumption, geography
3 Manufacture, farm, plant