2. Task
• Predict sentence level sentiment
• white blood cells destroying an infection: +
• an infection destroying white blood cells: -
• Predict the distribution of sentiment
1. Sorry, Hugs: User offers condolences to author.
2. You Rock: Indicating approval, congratulations.
3. Teehee: User found the anecdote amusing.
4. I Understand: Show of empathy.
5. Wow, Just Wow: Expression of surprise,shock.
2
3. Approach
• Learn a representation for the entire sentence using autoencoders
(unsupervised)
• A tree structure is learnt (node and edges)
• Learn a sentiment classifier using the representation learnt in the
previous step (supervised)
• Together, the approach becomes a semi-supervised learning task.
3
4. Neural Word Representations
• Randomly initialize the word representations
• For a vector x representing a word, sample it from a zero mean
Guassian x 2 Rn , x ⇠ N (0, 2 )
• Works well when the task is supervised because given training data we
can later tune the weights in the representation.
• Pre-trained word representations
• Bengio et al. 2003
• Given a context vector c, remove a word x that co-occur in c, and use
the remaining features in c to guess the occurrence of x.
• Similar to Ando et al 2003 suggestion for transfer learning via
Alternating Structure Optimization (ASO).
• Can take into account co-occurrence information.
• Rich syntactic and semantic representations for words!
4
5. Representing the input
• L: word-representation matrix
• Each word in the vocabulary is represented by an n-
dimensional vector
• Those vectors are then stacked in columns to create
the matrix L
• Given a word, k, it is represented as 1-of-k binary vector
bk. Then the vector representing the word k is given by,
x = L bk.
• This continuous representation is better than the
original binary representation because internal sigmoid
activators are continuous.
5
6. Autoencoders (Tree Given)
• binary trees are assumed
(each parent have two
children).
• Given child node
representations
iteratively compute the
representation of their
parent node.
• Concatenate the two
vectors for the child
nodes and apply W and
bias b, followed up by
Sigmoid function f. 6
7. Structure Prediction
• Build the tree as well as node representations
• Concept
• Given a sentence (a sequence of words), generate all possible trees with those
words.
• For each tree learn autoencoders at each non-leaf node and compute the
reconstruction error .
• Total reconstruction error of a tree is the sum of reconstruction errors at each
node in the tree.
• Select the tree with the minimum total reconstruction error.
Too expensive (slow) in practice!
7
8. Greedy Unsupervised RAE
• For each consecutive word pairs, compute their parent.
• Repeat this process until we are left with one parent. Given n child
nodes at some step, we are left with n-1 parents, thus reducing
the number of nodes to process as we go up the tree.
• Now, select the parent at each level that has the minimum
reconstruction error.
• Weighted reconstruction
• The number of descendants of a parent must be considered
when computing its reconstruction error.
• Parent vectors are L2 normalized (p/||p||) to avoid hidden layer
(W) becoming smaller (thereby reducing the reconstruction error) 8
9. Semi-Supervised RAE
• Each parent node is assigned with an 2n dimensional feature vector
• We can learn a weight vector (2n dim) on top of those parent vectors
• Logistic sigmoid function is used as the output
• The sum of cross-entropy is minimized over all parent nodes in the tree
• Similar to logistic regression
9
10. Equations
Sigmoid output at a parent node p
Cross entropy loss
Aggregate the loss over all
sentence x, label t pairs in the corpus
Total loss of a tree is the
sum of losses at all parent nodes
Loss at a parent node comes from two sources:
reconstruction error + cross entropy loss
10
11. Learning
• Compute the gradient w.r.t. parameters (Ws, b) and use LBFGS
• The objective function is not convex.
• Only local optima can be achieved with this procedure.
• Works well in practice.
11
12. EP Dataset
• Experience Project (EP) dataset
• People write confessions and others tag them.
• Five categories:
12
13. Baselines
• Random (20%)
• Most frequent (38.1%)
• I understand
• Binary Bag of words (46.4%)
• Represent each sentence as a binary bag of words and do not use pre-trained
representations.
• Features (47.0%)
• Use sentiment lexicons, training data (supervised sentiment classification with SVMs)
• Word vectors (45.5%)
• Ignore tree structure learnt by RAE. Aggregate the pre-trained representations for
each word in the sentence and train an SVM.
• Proposed method (50.1%)
• Learn the tree structure with RAE using unlabeled data (sentences only).
• Softmax layer is trained on each parent node using labeled data.
13
15. Random vectors!!!
• Randomly initializing the word vectors does very well!
• Because supervision occurs later, it is OK to initialize randomly
• Randomly initializing RAEs have shown similar good performances in
other tasks
• On random weights and unsupervised feature learning ICML’11
15
16. Summary
• Learn a sentiment classifier using recursive
autoencoders
• Structure is also learn using unlabeled data
• greedy algorithm for tree construction
• Semi-supervision at the parent level
• Model is general enough for other sentence level
classification tasks
• Random word representations do very well!
16