Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Deep learning勉強会20121214ochi by Shohei Ohsawa 25639 views
- JSAI's AI Tool Introduction - Deep ... by Kotaro Nakayama 30803 views
- Deep Learning を実装する by Shuhei Iitsuka 74526 views
- ADVERSARIAL AUTOENCODERS by 雅俊 上原 2722 views
- 論文輪読資料「Why regularized Auto-Encoder... by kurotaki_weblab 5784 views
- Learning Deep Architectures for AI ... by Shohei Ohsawa 57343 views

14,295 views

Published on

http://www.facebook.com/DeepLearning

https://sites.google.com/site/deeplearning2013/

Published in:
Technology

No Downloads

Total views

14,295

On SlideShare

0

From Embeds

0

Number of Embeds

10,715

Shares

0

Downloads

0

Comments

0

Likes

8

No embeds

No notes for slide

- 1. Semi-Supervised Autoencoders forPredicting Sentiment DistributionsSocher, Pennington, Huang, Ng, Manning (Stanford) Presented by Danushka Bollegala
- 2. Task• Predict sentence level sentiment • white blood cells destroying an infection: + • an infection destroying white blood cells: -• Predict the distribution of sentiment 1. Sorry, Hugs: User oﬀers condolences to author. 2. You Rock: Indicating approval, congratulations. 3. Teehee: User found the anecdote amusing. 4. I Understand: Show of empathy. 5. Wow, Just Wow: Expression of surprise,shock. 2
- 3. Approach• Learn a representation for the entire sentence using autoencoders (unsupervised) • A tree structure is learnt (node and edges)• Learn a sentiment classifier using the representation learnt in the previous step (supervised)• Together, the approach becomes a semi-supervised learning task. 3
- 4. Neural Word Representations• Randomly initialize the word representations • For a vector x representing a word, sample it from a zero mean Guassian x 2 Rn , x ⇠ N (0, 2 ) • Works well when the task is supervised because given training data we can later tune the weights in the representation.• Pre-trained word representations • Bengio et al. 2003 • Given a context vector c, remove a word x that co-occur in c, and use the remaining features in c to guess the occurrence of x. • Similar to Ando et al 2003 suggestion for transfer learning via Alternating Structure Optimization (ASO). • Can take into account co-occurrence information. • Rich syntactic and semantic representations for words! 4
- 5. Representing the input• L: word-representation matrix • Each word in the vocabulary is represented by an n- dimensional vector • Those vectors are then stacked in columns to create the matrix L• Given a word, k, it is represented as 1-of-k binary vector bk. Then the vector representing the word k is given by, x = L bk.• This continuous representation is better than the original binary representation because internal sigmoid activators are continuous. 5
- 6. Autoencoders (Tree Given)• binary trees are assumed (each parent have two children).• Given child node representations iteratively compute the representation of their parent node.• Concatenate the two vectors for the child nodes and apply W and bias b, followed up by Sigmoid function f. 6
- 7. Structure Prediction• Build the tree as well as node representations• Concept • Given a sentence (a sequence of words), generate all possible trees with those words. • For each tree learn autoencoders at each non-leaf node and compute the reconstruction error . • Total reconstruction error of a tree is the sum of reconstruction errors at each node in the tree. • Select the tree with the minimum total reconstruction error. Too expensive (slow) in practice! 7
- 8. Greedy Unsupervised RAE• For each consecutive word pairs, compute their parent.• Repeat this process until we are left with one parent. Given n child nodes at some step, we are left with n-1 parents, thus reducing the number of nodes to process as we go up the tree.• Now, select the parent at each level that has the minimum reconstruction error.• Weighted reconstruction • The number of descendants of a parent must be considered when computing its reconstruction error.• Parent vectors are L2 normalized (p/||p||) to avoid hidden layer (W) becoming smaller (thereby reducing the reconstruction error) 8
- 9. Semi-Supervised RAE• Each parent node is assigned with an 2n dimensional feature vector• We can learn a weight vector (2n dim) on top of those parent vectors• Logistic sigmoid function is used as the output• The sum of cross-entropy is minimized over all parent nodes in the tree• Similar to logistic regression 9
- 10. Equations Sigmoid output at a parent node p Cross entropy loss Aggregate the loss over all sentence x, label t pairs in the corpus Total loss of a tree is the sum of losses at all parent nodes Loss at a parent node comes from two sources: reconstruction error + cross entropy loss 10
- 11. Learning• Compute the gradient w.r.t. parameters (Ws, b) and use LBFGS• The objective function is not convex.• Only local optima can be achieved with this procedure.• Works well in practice. 11
- 12. EP Dataset• Experience Project (EP) dataset • People write confessions and others tag them. • Five categories: 12
- 13. Baselines• Random (20%)• Most frequent (38.1%) • I understand• Binary Bag of words (46.4%) • Represent each sentence as a binary bag of words and do not use pre-trained representations.• Features (47.0%) • Use sentiment lexicons, training data (supervised sentiment classification with SVMs)• Word vectors (45.5%) • Ignore tree structure learnt by RAE. Aggregate the pre-trained representations for each word in the sentence and train an SVM.• Proposed method (50.1%) • Learn the tree structure with RAE using unlabeled data (sentences only). • Softmax layer is trained on each parent node using labeled data. 13
- 14. Predicting the distribution 14
- 15. Random vectors!!!• Randomly initializing the word vectors does very well!• Because supervision occurs later, it is OK to initialize randomly• Randomly initializing RAEs have shown similar good performances in other tasks • On random weights and unsupervised feature learning ICML’11 15
- 16. Summary• Learn a sentiment classifier using recursive autoencoders• Structure is also learn using unlabeled data • greedy algorithm for tree construction• Semi-supervision at the parent level• Model is general enough for other sentence level classification tasks• Random word representations do very well! 16

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment