Semi-Supervised Autoencoders for
Predicting Sentiment Distributions
Socher, Pennington, Huang, Ng, Manning (Stanford)
         Presented by Danushka Bollegala
Task
•   Predict sentence level sentiment
    •   white blood cells destroying an infection: +
    •   an infection destroying white blood cells: -
•   Predict the distribution of sentiment
    1. Sorry, Hugs: User offers condolences to author.
    2. You Rock: Indicating approval, congratulations.
    3. Teehee: User found the anecdote amusing.
    4. I Understand: Show of empathy.
    5. Wow, Just Wow: Expression of surprise,shock.
                                                         2
Approach
•   Learn a representation for the entire sentence using autoencoders
    (unsupervised)
    •   A tree structure is learnt (node and edges)
•   Learn a sentiment classifier using the representation learnt in the
    previous step (supervised)
•   Together, the approach becomes a semi-supervised learning task.




                                                                          3
Neural Word Representations
•   Randomly initialize the word representations
    •   For a vector x representing a word, sample it from a zero mean
        Guassian x 2 Rn , x ⇠ N (0, 2 )
    •   Works well when the task is supervised because given training data we
        can later tune the weights in the representation.
•   Pre-trained word representations
    •   Bengio et al. 2003
    •   Given a context vector c, remove a word x that co-occur in c, and use
        the remaining features in c to guess the occurrence of x.
    •   Similar to Ando et al 2003 suggestion for transfer learning via
        Alternating Structure Optimization (ASO).
    •   Can take into account co-occurrence information.
    •   Rich syntactic and semantic representations for words!
                                                                                4
Representing the input
•   L: word-representation matrix
    •   Each word in the vocabulary is represented by an n-
        dimensional vector
    •   Those vectors are then stacked in columns to create
        the matrix L
•   Given a word, k, it is represented as 1-of-k binary vector
    bk. Then the vector representing the word k is given by,
    x = L bk.
•   This continuous representation is better than the
    original binary representation because internal sigmoid
    activators are continuous.
                                                              5
Autoencoders (Tree Given)
•   binary trees are assumed
    (each parent have two
    children).
•   Given child node
    representations
    iteratively compute the
    representation of their
    parent node.
•   Concatenate the two
    vectors for the child
    nodes and apply W and
    bias b, followed up by
    Sigmoid function f.        6
Structure Prediction
•   Build the tree as well as node representations
•   Concept
    •   Given a sentence (a sequence of words), generate all possible trees with those
        words.
    •   For each tree learn autoencoders at each non-leaf node and compute the
        reconstruction error .
    •   Total reconstruction error of a tree is the sum of reconstruction errors at each
        node in the tree.
    •   Select the tree with the minimum total reconstruction error.




        Too expensive (slow) in practice!

                                                                                           7
Greedy Unsupervised RAE
•   For each consecutive word pairs, compute their parent.
•   Repeat this process until we are left with one parent. Given n child
    nodes at some step, we are left with n-1 parents, thus reducing
    the number of nodes to process as we go up the tree.
•   Now, select the parent at each level that has the minimum
    reconstruction error.
•   Weighted reconstruction
    •   The number of descendants of a parent must be considered
        when computing its reconstruction error.




•   Parent vectors are L2 normalized (p/||p||) to avoid hidden layer
    (W) becoming smaller (thereby reducing the reconstruction error) 8
Semi-Supervised RAE
•   Each parent node is assigned with an 2n dimensional feature vector
•   We can learn a weight vector (2n dim) on top of those parent vectors
•   Logistic sigmoid function is used as the output
•   The sum of cross-entropy is minimized over all parent nodes in the tree
•   Similar to logistic regression




                                                                           9
Equations
   Sigmoid output at a parent node p

     Cross entropy loss

          Aggregate the loss over all
     sentence x, label t pairs in the corpus


                 Total loss of a tree is the
              sum of losses at all parent nodes


        Loss at a parent node comes from two sources:
          reconstruction error + cross entropy loss


                                                  10
Learning
•   Compute the gradient w.r.t. parameters (Ws, b) and use LBFGS
•   The objective function is not convex.
•   Only local optima can be achieved with this procedure.
•   Works well in practice.




                                                                   11
EP Dataset
•   Experience Project (EP) dataset
    •   People write confessions and others tag them.
    •   Five categories:




                                                        12
Baselines
•   Random (20%)
•   Most frequent (38.1%)
    •   I understand
•   Binary Bag of words (46.4%)
    •   Represent each sentence as a binary bag of words and do not use pre-trained
        representations.
•   Features (47.0%)
    •   Use sentiment lexicons, training data (supervised sentiment classification with SVMs)
•   Word vectors (45.5%)
    •   Ignore tree structure learnt by RAE. Aggregate the pre-trained representations for
        each word in the sentence and train an SVM.
•   Proposed method (50.1%)
    •   Learn the tree structure with RAE using unlabeled data (sentences only).
    •   Softmax layer is trained on each parent node using labeled data.

                                                                                             13
Predicting the distribution




                          14
Random vectors!!!
•   Randomly initializing the word vectors does very well!
•   Because supervision occurs later, it is OK to initialize randomly
•   Randomly initializing RAEs have shown similar good performances in
    other tasks
    •   On random weights and unsupervised feature learning ICML’11




                                                                         15
Summary
•   Learn a sentiment classifier using recursive
    autoencoders
•   Structure is also learn using unlabeled data
    •   greedy algorithm for tree construction
•   Semi-supervision at the parent level
•   Model is general enough for other sentence level
    classification tasks
•   Random word representations do very well!
                                                       16

Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 Deep Learning 勉強会資料; ダヌシカ)

  • 1.
    Semi-Supervised Autoencoders for PredictingSentiment Distributions Socher, Pennington, Huang, Ng, Manning (Stanford) Presented by Danushka Bollegala
  • 2.
    Task • Predict sentence level sentiment • white blood cells destroying an infection: + • an infection destroying white blood cells: - • Predict the distribution of sentiment 1. Sorry, Hugs: User offers condolences to author. 2. You Rock: Indicating approval, congratulations. 3. Teehee: User found the anecdote amusing. 4. I Understand: Show of empathy. 5. Wow, Just Wow: Expression of surprise,shock. 2
  • 3.
    Approach • Learn a representation for the entire sentence using autoencoders (unsupervised) • A tree structure is learnt (node and edges) • Learn a sentiment classifier using the representation learnt in the previous step (supervised) • Together, the approach becomes a semi-supervised learning task. 3
  • 4.
    Neural Word Representations • Randomly initialize the word representations • For a vector x representing a word, sample it from a zero mean Guassian x 2 Rn , x ⇠ N (0, 2 ) • Works well when the task is supervised because given training data we can later tune the weights in the representation. • Pre-trained word representations • Bengio et al. 2003 • Given a context vector c, remove a word x that co-occur in c, and use the remaining features in c to guess the occurrence of x. • Similar to Ando et al 2003 suggestion for transfer learning via Alternating Structure Optimization (ASO). • Can take into account co-occurrence information. • Rich syntactic and semantic representations for words! 4
  • 5.
    Representing the input • L: word-representation matrix • Each word in the vocabulary is represented by an n- dimensional vector • Those vectors are then stacked in columns to create the matrix L • Given a word, k, it is represented as 1-of-k binary vector bk. Then the vector representing the word k is given by, x = L bk. • This continuous representation is better than the original binary representation because internal sigmoid activators are continuous. 5
  • 6.
    Autoencoders (Tree Given) • binary trees are assumed (each parent have two children). • Given child node representations iteratively compute the representation of their parent node. • Concatenate the two vectors for the child nodes and apply W and bias b, followed up by Sigmoid function f. 6
  • 7.
    Structure Prediction • Build the tree as well as node representations • Concept • Given a sentence (a sequence of words), generate all possible trees with those words. • For each tree learn autoencoders at each non-leaf node and compute the reconstruction error . • Total reconstruction error of a tree is the sum of reconstruction errors at each node in the tree. • Select the tree with the minimum total reconstruction error. Too expensive (slow) in practice! 7
  • 8.
    Greedy Unsupervised RAE • For each consecutive word pairs, compute their parent. • Repeat this process until we are left with one parent. Given n child nodes at some step, we are left with n-1 parents, thus reducing the number of nodes to process as we go up the tree. • Now, select the parent at each level that has the minimum reconstruction error. • Weighted reconstruction • The number of descendants of a parent must be considered when computing its reconstruction error. • Parent vectors are L2 normalized (p/||p||) to avoid hidden layer (W) becoming smaller (thereby reducing the reconstruction error) 8
  • 9.
    Semi-Supervised RAE • Each parent node is assigned with an 2n dimensional feature vector • We can learn a weight vector (2n dim) on top of those parent vectors • Logistic sigmoid function is used as the output • The sum of cross-entropy is minimized over all parent nodes in the tree • Similar to logistic regression 9
  • 10.
    Equations Sigmoid output at a parent node p Cross entropy loss Aggregate the loss over all sentence x, label t pairs in the corpus Total loss of a tree is the sum of losses at all parent nodes Loss at a parent node comes from two sources: reconstruction error + cross entropy loss 10
  • 11.
    Learning • Compute the gradient w.r.t. parameters (Ws, b) and use LBFGS • The objective function is not convex. • Only local optima can be achieved with this procedure. • Works well in practice. 11
  • 12.
    EP Dataset • Experience Project (EP) dataset • People write confessions and others tag them. • Five categories: 12
  • 13.
    Baselines • Random (20%) • Most frequent (38.1%) • I understand • Binary Bag of words (46.4%) • Represent each sentence as a binary bag of words and do not use pre-trained representations. • Features (47.0%) • Use sentiment lexicons, training data (supervised sentiment classification with SVMs) • Word vectors (45.5%) • Ignore tree structure learnt by RAE. Aggregate the pre-trained representations for each word in the sentence and train an SVM. • Proposed method (50.1%) • Learn the tree structure with RAE using unlabeled data (sentences only). • Softmax layer is trained on each parent node using labeled data. 13
  • 14.
  • 15.
    Random vectors!!! • Randomly initializing the word vectors does very well! • Because supervision occurs later, it is OK to initialize randomly • Randomly initializing RAEs have shown similar good performances in other tasks • On random weights and unsupervised feature learning ICML’11 15
  • 16.
    Summary • Learn a sentiment classifier using recursive autoencoders • Structure is also learn using unlabeled data • greedy algorithm for tree construction • Semi-supervision at the parent level • Model is general enough for other sentence level classification tasks • Random word representations do very well! 16