Co-clustering based classification for
     out-of-domain documents
         Authors: Wenyuan Dai, Gui-Rong Xue,
                   Qiang Yang,Yong Yu
   KDD’07, August 12-15, 2007, San Jose, California, USA.
                 Presenter: Rei-Zhe Liu
Outline
     Introduction
     Problem formulation
     Co-clustering based classification algorithm
     Experiments
       Data sets
       Methods
         Data preprocessing
         Evaluation metrics
         Experimental results

     Conclusions and future works


2
Introduction(1/3)
     In this paper, we focus on the problem of classifying
      documents across different domains.

     We have a labeled data set Di from one domain called in-
      domain, and another data set Do from a related but
      different domain called out-of-domain which is unlabeled.

     Di and Do are drawn from different distributions, since they
      are from different domains.


3
Introduction(2/3)
     We assume that, the class labels in Di and the labels to be
      predicted in Do are drawn from the same class-label set C.

     Furthermore, we assume that even though the two domains
      are different in distributions, they are similar in the sense that
      similar words describe similar categories.




4
Introduction(3/3)
     This assumption is often true, since Di and Do are related
      text domains, although some words in one domain may be
      missing in the other domain.

     They are the reason which makes the estimated probability in
      the two domains to be quite different.

     Under such circumstances, our objective is to accurately
      classify the out-of-domain documents in Do, by making use
      of the in-domain data Di and their labels.

5
Problem formulation(1/5)
     The mutual information is a measure of the dependency
      between random variables.

     Let X and Y be random variable sets with a joint distribution
      p(X,Y) and marginal distributions p(X) and p(Y). The
      mutual information I(X,Y) is defined as




6
Problem formulation(2/5)
     The use of mutual information can also be motivated using
      the KL divergence, defined for two probability mass function
      p(x) and q(x).




7
Problem formulation(3/5)
     Co-clustering on out-of-domain data aims to simultaneously
      cluster the out-of-domain documents Do and words W into
      |C| document cluster and k word clusters.

     Let ^Do denote the out-of-domain document clustering, and
      ^W denote the word clustering, where |^W| = k




8
Problem formulation(4/5)
     We aims to minimize the loss in mutual information between
      documents and words before and after clustering.



     We aims to minimize the loss in mutual information between
      class labels C and words W before and after clustering, for in-
      domain data.




9
Problem formulation(5/5)
      Integrating the two function, the loss function for co-
       clustering based classification can be obtained:

      where λ is a trade-off parameter that balances the effect to
       word clusters from co-clustering.




10
Co-clustering based classification
     algorithm(1/6)
      Lemma 1: For a fixed co-clustering(^Do, ^W), we can write
       the loss in mutual information as



      Definition:




11
Co-clustering based classification
     algorithm(2/6)
      Lemma 2:




      Note that




12
Co-clustering based classification
     algorithm(3/6)
      Lemma 3:




      Lemma 2 tells us that minimizing
       corresponding to a single document d can reduce the global
       objective function value given in Lemma 1.
      From Lemma 3, we can obtain the similar conclusion with
       that in lemma 2.



13
Co-clustering based classification
     algorithm(4/6)




14
Co-clustering based classification
     algorithm(5/6)




15
Co-clustering based classification
     algorithm(6/6)
      Regarding the computational complexity, suppose the total
       number of document-word co-occurrences in Do is N. The
       number of iterations is T, therefore,

      Considering space complexity, our algorithm need to store
       all the document-word co-occurrences and their
       corresponding probabilities. Thus the space complexity is




16
Experiments
Data sets(1/6)
      20 Newsgroups, SRAA, Reuters-21578




18
Data sets(2/6)
      The 20 newsgroups is a text collection of approximately
       20000 newsgroup documents, partitioned across 20 different
       newsgroup nearly evenly.
        Generated 6 different data sets.
        For each data set, 2 top categories are chosen.
        One as positive, and the other as negative.
        Different subcategories can be considered as different domains.




19
Data sets(3/6)




20
Data sets(4/6)




21
Data sets(5/6)
      Figure 2 shows the document-word co-occurrence
       distribution on the auto vs. aviation dataset.

      The documents are ordered first by their domains, 1 to 8000
       are from Di, while 8001 to 16000 are from Do, and second
       by their categories, positive or negative.




22
Data sets(6/6)
      The words are sorted by n+(w) / n-(w), where n+(w) and n-
       (w) represent the number of word positions w appears in
       positive and negative documents, respectively.

      From Figure 2, it can be found that the distributions of in-
       domain and out-of-domain data are somewhat different,
       however also shows large commonness exists between the
       two domains.




23
Data preprocessing
      First, we converted all the letters in the text to lower case,
       and stemmed the words using the Porter stemmer.

      Besides, stop words were removed.




24
Evaluation metrics
      Let C be the function which maps from document d to its
       true class label c = C(d) and F be the function which maps
       from document d to its prediction label c = F(d) given by the
       classifiers.




25
Experimental results(1/5)




26
Experimental results(2/5)




27
Experimental results(3/5)




28
Experimental results(4/5)




29
Experimental results(5/5)




30
Conclusions and future works
      In our experiment, it is shown that CoCC greatly
       outperforms traditional supervised and semi-supervised
       classification algorithms when classifying out-of-domain
       documents.
      In CoCC, the number of word clusters are quite large to
       obtain good performance. Since the time complexity of
       CoCC depends on the number of word clusters, it can be
       inefficient.
      In the future, we will try to speed up the algorithm to make
       it more scalable for large data set.


31

Coclustering Base Classification For Out Of Domain Documents

  • 1.
    Co-clustering based classificationfor out-of-domain documents Authors: Wenyuan Dai, Gui-Rong Xue, Qiang Yang,Yong Yu KDD’07, August 12-15, 2007, San Jose, California, USA. Presenter: Rei-Zhe Liu
  • 2.
    Outline  Introduction  Problem formulation  Co-clustering based classification algorithm  Experiments  Data sets  Methods  Data preprocessing  Evaluation metrics  Experimental results  Conclusions and future works 2
  • 3.
    Introduction(1/3)  In this paper, we focus on the problem of classifying documents across different domains.  We have a labeled data set Di from one domain called in- domain, and another data set Do from a related but different domain called out-of-domain which is unlabeled.  Di and Do are drawn from different distributions, since they are from different domains. 3
  • 4.
    Introduction(2/3)  We assume that, the class labels in Di and the labels to be predicted in Do are drawn from the same class-label set C.  Furthermore, we assume that even though the two domains are different in distributions, they are similar in the sense that similar words describe similar categories. 4
  • 5.
    Introduction(3/3)  This assumption is often true, since Di and Do are related text domains, although some words in one domain may be missing in the other domain.  They are the reason which makes the estimated probability in the two domains to be quite different.  Under such circumstances, our objective is to accurately classify the out-of-domain documents in Do, by making use of the in-domain data Di and their labels. 5
  • 6.
    Problem formulation(1/5)  The mutual information is a measure of the dependency between random variables.  Let X and Y be random variable sets with a joint distribution p(X,Y) and marginal distributions p(X) and p(Y). The mutual information I(X,Y) is defined as 6
  • 7.
    Problem formulation(2/5)  The use of mutual information can also be motivated using the KL divergence, defined for two probability mass function p(x) and q(x). 7
  • 8.
    Problem formulation(3/5)  Co-clustering on out-of-domain data aims to simultaneously cluster the out-of-domain documents Do and words W into |C| document cluster and k word clusters.  Let ^Do denote the out-of-domain document clustering, and ^W denote the word clustering, where |^W| = k 8
  • 9.
    Problem formulation(4/5)  We aims to minimize the loss in mutual information between documents and words before and after clustering.  We aims to minimize the loss in mutual information between class labels C and words W before and after clustering, for in- domain data. 9
  • 10.
    Problem formulation(5/5)  Integrating the two function, the loss function for co- clustering based classification can be obtained:  where λ is a trade-off parameter that balances the effect to word clusters from co-clustering. 10
  • 11.
    Co-clustering based classification algorithm(1/6)  Lemma 1: For a fixed co-clustering(^Do, ^W), we can write the loss in mutual information as  Definition: 11
  • 12.
    Co-clustering based classification algorithm(2/6)  Lemma 2:  Note that 12
  • 13.
    Co-clustering based classification algorithm(3/6)  Lemma 3:  Lemma 2 tells us that minimizing corresponding to a single document d can reduce the global objective function value given in Lemma 1.  From Lemma 3, we can obtain the similar conclusion with that in lemma 2. 13
  • 14.
  • 15.
  • 16.
    Co-clustering based classification algorithm(6/6)  Regarding the computational complexity, suppose the total number of document-word co-occurrences in Do is N. The number of iterations is T, therefore,  Considering space complexity, our algorithm need to store all the document-word co-occurrences and their corresponding probabilities. Thus the space complexity is 16
  • 17.
  • 18.
    Data sets(1/6)  20 Newsgroups, SRAA, Reuters-21578 18
  • 19.
    Data sets(2/6)  The 20 newsgroups is a text collection of approximately 20000 newsgroup documents, partitioned across 20 different newsgroup nearly evenly.  Generated 6 different data sets.  For each data set, 2 top categories are chosen.  One as positive, and the other as negative.  Different subcategories can be considered as different domains. 19
  • 20.
  • 21.
  • 22.
    Data sets(5/6)  Figure 2 shows the document-word co-occurrence distribution on the auto vs. aviation dataset.  The documents are ordered first by their domains, 1 to 8000 are from Di, while 8001 to 16000 are from Do, and second by their categories, positive or negative. 22
  • 23.
    Data sets(6/6)  The words are sorted by n+(w) / n-(w), where n+(w) and n- (w) represent the number of word positions w appears in positive and negative documents, respectively.  From Figure 2, it can be found that the distributions of in- domain and out-of-domain data are somewhat different, however also shows large commonness exists between the two domains. 23
  • 24.
    Data preprocessing  First, we converted all the letters in the text to lower case, and stemmed the words using the Porter stemmer.  Besides, stop words were removed. 24
  • 25.
    Evaluation metrics  Let C be the function which maps from document d to its true class label c = C(d) and F be the function which maps from document d to its prediction label c = F(d) given by the classifiers. 25
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    Conclusions and futureworks  In our experiment, it is shown that CoCC greatly outperforms traditional supervised and semi-supervised classification algorithms when classifying out-of-domain documents.  In CoCC, the number of word clusters are quite large to obtain good performance. Since the time complexity of CoCC depends on the number of word clusters, it can be inefficient.  In the future, we will try to speed up the algorithm to make it more scalable for large data set. 31