SlideShare a Scribd company logo
1 of 2
A hybrid approach to discover high-level category structure of a statistical
website.

Dihui Lu
School of Information and Library Science, University of North Carolina at Chapel Hill, lud@ils.unc.edu

Junliang Zhang
School of Information and Library Science, University of North Carolina at Chapel Hill,
junliang@email.unc.edu


INTRODUCTION                                                     documents as well. After that, all words were stemmed
   One way to help people navigate and understand a large        using Porter stemming algorithm (Porter, 1980). Among
information collection is to provide a yahoo-like category       these documents, 1063 weekly review articles from BLS
structure.        In         the        Govstats       project   website with subject headings were used for supervised
(http://ils.unc.edu/govstats), an unsupervised classification    classification. 45 unique subject-heading categories were
method was used to cluster the web documents from                identified and the top 13 labels with the most labeled
Bureau of Labor Statistics website (http://www.bls.gov)          documents were chosen for classification (other subject
into top-level category structure. Even though some              heading were discarded due to the fact that they were
promising results were generated, the heterogeneity of           assigned to too few documents). The labels of these
existing html documents (include tables, publications,           categories and the corresponding numbers of documents
standard occupational codebook etc.) made it difficult for       are displayed in table 1. Document-term matrix was
the unsupervised learning approach to work very well.            generated and used as input for Weka, which is a collection
Actually, there are some pre-labeled documents from “The         of machine learning algorithms for data mining task (http://
Editor’s       Desk”        on      the      BLS      website    www.cs.waikato.ac.nz/~ml/weka/) (Witten, & Frank,
(http://www.bls.gov/opub/ted/), which can be utilized as         2000).
training data set for supervised classification. However, the
topics of the training data set can not fully represent those      Table1. 1063 training documents for 13 categories.
of the whole website. Hence, we employed a hybrid
approach which combined supervised learning approach                Subject Category         Number of Documents
with unsupervised learning technique for category                   Benefits                 22
discovery. Both NaiveBayes and Supported Vector                     Compensation costs       85
Machine learning algorithms were used to build the                  Consumer                 51
classifiers. Based on the average accuracy of the                   Earnings and wages       151
algorithms, we applied the classifier generated from SVM            Employment               225
to classify the rest of the unlabelled web documents. Due to        Industry studies         29
the fact that the training set was not a representative set of      Labor force              22
the site, part of the unlabelled documents had low                  Manufacturing            27
probabilities to be classified into any of the known                Occupational safety      58
categories. As a complementary approach, K-means was                Occupations              20
then tried to cluster these documents. By using this hybrid         Prices                   206
approach, we have identified three more categories. Further
                                                                    Productivity             55
research will focus on the evaluation of this hybrid
                                                                    Unemployment             111
approach and compare it with current human generated
categories.
                                                                   In our preliminary study, both NaïveBayes and Support
                                                                 Vector Machine were used for building the classifier
METHOD                                                           models on training data set. Since the performance of a text
   There are totally 17068 documents crawled from the
                                                                 classifier also depends on the selection of an appropriate
BLS website as of Jan 2004. The goal is to discover the
                                                                 features used for the representation of the documents (such
category structure and classify these documents under this
                                                                 as taking titles as the representation of documents, or
structure. The stop word list was used to reduce the trivial
                                                                 taking full text as the representation of documents), we
terms for all documents. The numbers and other non-
                                                                 studied the impact of the different algorithms and features
textual symbols such as html tags were removed from the
                                                                 on the classification performance. Ten fold cross validation
were used to compare the classification performance.
Based on average accuracy and IR precision/recall (see           It seems that the cluster 0, 2, and 3 generate some new
table 2), we chose SVM as out classification algorithm to        concepts which are not included in previous 13 categories.
build the model and classify the rest of web documents           However, these concepts for each cluster are manually
represented by full text (title and body text).                  extracted by reading randomly selected documents. We
                                                                 hope we can get more comprehensive ones by employing
  Table 2. Preliminary results from NaiveBayes and SVM           automatic concept extraction techniques.
                                                                    Further study will include the evaluation of the
                    Naïve Bayes              SVM                 automatically generated categories and compare it with
     Features      Title    Title       Title    Title           current human generated categories. In addition, since we
                           +body                +body            have some promising results from the combination of
     Accuracy      76.7     76.2        81.8     83.5            Latent Semantic Indexing and SVM for classification of the
        IR         0.54     0.68        0.56     0.77            web documents, we plan to apply LSI+SVM to improve the
     precision                                                   classification accuracy. Besides, we intend to try other
     IR recall      0.66      0.81      0.46       0.73          more sophisticated clustering algorithms such as EM
                                                                 (Expectation-maximization) algorithm.
Since the training set from the weekly review articles of the
site was not a representative data set of whole web site, the    ACKNOWLEDGMENT
classifier built from the training set would have low            Thanks to Jonathan Elsas for writing part of the code. The
performance on some of the web documents. The SVM                work is supported by NSF grant EIA 0131824.
classifier assigned each document probabilities of
belonging to each category. We identified 1988 documents         REFERENCE
which had low probabilities to any of the categories. K-         Efron, M., Marchionini, G. and Zhang, J. (2003).
means clustering algorithm was then used to cluster the          Implications of the Recursive Representation problem for
documents into four clusters. The statistics on the results is   Automatic Concept Identification in On-line Governmental
shown in table 3. The concepts which are manually                Information. ASIST SIG-CR Workshop (Long Beach, CA,
extracted for each cluster are displayed in the table 4.         October 18, 2003).
                                                                 Porter, M.F., 1980, An algorithm for suffix stripping,
Table 3. Number of documents in each cluster
                                                                 Program, 14(3) :130-137
Cluster number                  Number of docs                   Witten, I. H., and Frank, E. (2000). Data Mining: Practical
0                               482                              machine learning tools with Java implementations. Morgan
1                               832                              Kaufmann, San Francisco.
2                               105
3                               569

Table 4. Concepts extracted from each cluster

Cluster number                  Concepts
0                               Career, secure, retire
1                               Benefit, earn
2                               Consumption, geography
3                               Manufacture, farm, plant

More Related Content

What's hot

Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
Prashant Menon
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extraction
ijsrd.com
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
butest
 
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMSPREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
Samsung Electronics
 
2012.10 - DDI Lifecycle - Moving Forward - 3
2012.10 - DDI Lifecycle - Moving Forward - 32012.10 - DDI Lifecycle - Moving Forward - 3
2012.10 - DDI Lifecycle - Moving Forward - 3
Dr.-Ing. Thomas Hartmann
 

What's hot (18)

IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
 
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET -  	  Movie Genre Prediction from Plot Summaries by Comparing Various C...IRJET -  	  Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Weka
Weka Weka
Weka
 
weka data mining
weka data mining weka data mining
weka data mining
 
Mca1040 system analysis and design
Mca1040  system analysis and designMca1040  system analysis and design
Mca1040 system analysis and design
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extraction
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
J48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataJ48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance Data
 
Predicting performance of classification algorithms
Predicting performance of classification algorithmsPredicting performance of classification algorithms
Predicting performance of classification algorithms
 
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMSPREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
 
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and ApplicationsIRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
 
2012.10 - DDI Lifecycle - Moving Forward - 3
2012.10 - DDI Lifecycle - Moving Forward - 32012.10 - DDI Lifecycle - Moving Forward - 3
2012.10 - DDI Lifecycle - Moving Forward - 3
 

Similar to INTRODUCTION

View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
butest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
butest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
butest
 
Supervised Machine Learning: A Review of Classification ...
Supervised Machine Learning: A Review of Classification ...Supervised Machine Learning: A Review of Classification ...
Supervised Machine Learning: A Review of Classification ...
butest
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
Guido A. Ciollaro
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
butest
 

Similar to INTRODUCTION (20)

MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
Comparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News ArticlesComparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News Articles
 
Supervised Machine Learning: A Review of Classification ...
Supervised Machine Learning: A Review of Classification ...Supervised Machine Learning: A Review of Classification ...
Supervised Machine Learning: A Review of Classification ...
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
An Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersAn Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their Classifiers
 
Automated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning TechniquesAutomated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning Techniques
 
Performance Analysis of Selected Classifiers in User Profiling
Performance Analysis of Selected Classifiers in User ProfilingPerformance Analysis of Selected Classifiers in User Profiling
Performance Analysis of Selected Classifiers in User Profiling
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
IRJET- Analysis of Brand Value Prediction based on Social Media Data
IRJET-  	  Analysis of Brand Value Prediction based on Social Media DataIRJET-  	  Analysis of Brand Value Prediction based on Social Media Data
IRJET- Analysis of Brand Value Prediction based on Social Media Data
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machine
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

INTRODUCTION

  • 1. A hybrid approach to discover high-level category structure of a statistical website. Dihui Lu School of Information and Library Science, University of North Carolina at Chapel Hill, lud@ils.unc.edu Junliang Zhang School of Information and Library Science, University of North Carolina at Chapel Hill, junliang@email.unc.edu INTRODUCTION documents as well. After that, all words were stemmed One way to help people navigate and understand a large using Porter stemming algorithm (Porter, 1980). Among information collection is to provide a yahoo-like category these documents, 1063 weekly review articles from BLS structure. In the Govstats project website with subject headings were used for supervised (http://ils.unc.edu/govstats), an unsupervised classification classification. 45 unique subject-heading categories were method was used to cluster the web documents from identified and the top 13 labels with the most labeled Bureau of Labor Statistics website (http://www.bls.gov) documents were chosen for classification (other subject into top-level category structure. Even though some heading were discarded due to the fact that they were promising results were generated, the heterogeneity of assigned to too few documents). The labels of these existing html documents (include tables, publications, categories and the corresponding numbers of documents standard occupational codebook etc.) made it difficult for are displayed in table 1. Document-term matrix was the unsupervised learning approach to work very well. generated and used as input for Weka, which is a collection Actually, there are some pre-labeled documents from “The of machine learning algorithms for data mining task (http:// Editor’s Desk” on the BLS website www.cs.waikato.ac.nz/~ml/weka/) (Witten, & Frank, (http://www.bls.gov/opub/ted/), which can be utilized as 2000). training data set for supervised classification. However, the topics of the training data set can not fully represent those Table1. 1063 training documents for 13 categories. of the whole website. Hence, we employed a hybrid approach which combined supervised learning approach Subject Category Number of Documents with unsupervised learning technique for category Benefits 22 discovery. Both NaiveBayes and Supported Vector Compensation costs 85 Machine learning algorithms were used to build the Consumer 51 classifiers. Based on the average accuracy of the Earnings and wages 151 algorithms, we applied the classifier generated from SVM Employment 225 to classify the rest of the unlabelled web documents. Due to Industry studies 29 the fact that the training set was not a representative set of Labor force 22 the site, part of the unlabelled documents had low Manufacturing 27 probabilities to be classified into any of the known Occupational safety 58 categories. As a complementary approach, K-means was Occupations 20 then tried to cluster these documents. By using this hybrid Prices 206 approach, we have identified three more categories. Further Productivity 55 research will focus on the evaluation of this hybrid Unemployment 111 approach and compare it with current human generated categories. In our preliminary study, both NaïveBayes and Support Vector Machine were used for building the classifier METHOD models on training data set. Since the performance of a text There are totally 17068 documents crawled from the classifier also depends on the selection of an appropriate BLS website as of Jan 2004. The goal is to discover the features used for the representation of the documents (such category structure and classify these documents under this as taking titles as the representation of documents, or structure. The stop word list was used to reduce the trivial taking full text as the representation of documents), we terms for all documents. The numbers and other non- studied the impact of the different algorithms and features textual symbols such as html tags were removed from the on the classification performance. Ten fold cross validation
  • 2. were used to compare the classification performance. Based on average accuracy and IR precision/recall (see It seems that the cluster 0, 2, and 3 generate some new table 2), we chose SVM as out classification algorithm to concepts which are not included in previous 13 categories. build the model and classify the rest of web documents However, these concepts for each cluster are manually represented by full text (title and body text). extracted by reading randomly selected documents. We hope we can get more comprehensive ones by employing Table 2. Preliminary results from NaiveBayes and SVM automatic concept extraction techniques. Further study will include the evaluation of the Naïve Bayes SVM automatically generated categories and compare it with Features Title Title Title Title current human generated categories. In addition, since we +body +body have some promising results from the combination of Accuracy 76.7 76.2 81.8 83.5 Latent Semantic Indexing and SVM for classification of the IR 0.54 0.68 0.56 0.77 web documents, we plan to apply LSI+SVM to improve the precision classification accuracy. Besides, we intend to try other IR recall 0.66 0.81 0.46 0.73 more sophisticated clustering algorithms such as EM (Expectation-maximization) algorithm. Since the training set from the weekly review articles of the site was not a representative data set of whole web site, the ACKNOWLEDGMENT classifier built from the training set would have low Thanks to Jonathan Elsas for writing part of the code. The performance on some of the web documents. The SVM work is supported by NSF grant EIA 0131824. classifier assigned each document probabilities of belonging to each category. We identified 1988 documents REFERENCE which had low probabilities to any of the categories. K- Efron, M., Marchionini, G. and Zhang, J. (2003). means clustering algorithm was then used to cluster the Implications of the Recursive Representation problem for documents into four clusters. The statistics on the results is Automatic Concept Identification in On-line Governmental shown in table 3. The concepts which are manually Information. ASIST SIG-CR Workshop (Long Beach, CA, extracted for each cluster are displayed in the table 4. October 18, 2003). Porter, M.F., 1980, An algorithm for suffix stripping, Table 3. Number of documents in each cluster Program, 14(3) :130-137 Cluster number Number of docs Witten, I. H., and Frank, E. (2000). Data Mining: Practical 0 482 machine learning tools with Java implementations. Morgan 1 832 Kaufmann, San Francisco. 2 105 3 569 Table 4. Concepts extracted from each cluster Cluster number Concepts 0 Career, secure, retire 1 Benefit, earn 2 Consumption, geography 3 Manufacture, farm, plant