0
Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Label propagation - Semisupervised Learning with Applications to NLP

2,816

Published on

Label Propagation

Label Propagation

Published in: Education
4 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
2,816
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
101
0
Likes
4
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Label Propagation Seminar:Semi-supervised and unsupervised learning with Applications to NLP David Przybilla davida@coli.uni-saarland.de
• 2. Outline&#x25CF; What is Label Propagation&#x25CF; The Algorithm&#x25CF; The motivation behind the algorithm&#x25CF; Parameters of Label Propagation&#x25CF; Relation Extraction with Label Propagation
• 3. Label Propagation&#x25CF; Semi-supervised&#x25CF; Shows good results when the amount of annotated data is low with respect to the supervised options&#x25CF; Similar to kNN
• 4. K-Nearest Neighbors(KNN) &#x25CF; Shares similar ideas with Label Propagation &#x25CF; Label Propagation (LP) uses unlabeled instances during the process of finding out the labels
• 5. Idea of the Problem Similar near Unlabeled Instances should have similar Labels L=set of Labeled Instances U =set of Unlabeled InstancesWe want to find a function f such that:
• 6. The Model&#x25CF; A complete graph &#x25CF; Each Node is an instance &#x25CF; Each arc has a weight T xy &#x25CF; T xy is high if Nodes x and y are similar.
• 7. The Model&#x25CF; Inside a Node: Soft Labels
• 8. Variables - Model &#x25CF; T is a matrix, holding all the weights of the graph N 1 ... N l = Labeled Data TllTlu N l+1 .. N n=Unlabeled Data T u lT u uTllTluT ulT uu
• 9. Variables - Model&#x25CF; Y is a matrix, holding the soft probabilities of each instance YN a n , R b is the probability of a being labeled as R b YL YU The problem to solveR1 , R 2 ... R k each of the possible labelsN 1 , N 2 ... N n each of the instances to label
• 10. Algorithm Y will change in each iteration
• 11. How to Measure T? Distance Measure Euclidean DistanceImportant Parameter(ignore it at the moment) we will talk about this later
• 12. How to Initialize Y? 0 &#x25CF; How to Correctly set the values of Y ? &#x25CF; Fill the known values (of the labeled data) &#x25CF; How to fill the values of the unlabeled data? &#x2192; The initialization of this values can be arbitrary.&#x25CF; Transform T into T (row normalization)
• 13. Propagation Step&#x25CF; During the process Y will change 0 1 k Y &#x2192; Y &#x2192; ... &#x2192; Y &#x25CF; Update Y during each iteration
• 14. ConvergenceDuring the iteration Clamped Yl &#x304; T l l T&#x304;l u Yl = Yu T&#x304;u l T&#x304; u u Yu Assumming we iterate infinite times then: 1 Y =T U &#x304;uu Y 0+ T ul Y L u &#x304; 2 Y =T U &#x304;uu ( T&#x304;uu Y 0 + T ul Y L )+T ul Y L u &#x304; &#x304; ...
• 15. Convergence &#x304;Since T is normalized and &#x304; is a submatrix of T:Doing it n times will lead to: Converges to Zero
• 16. After convergenceAfter convergence one can find by solving: =
• 17. Optimization Problem w i j : Similarity between i j F should minimize the energy functionf (i ) and f ( j) should be similar for a high w i j in order to minimize
• 18. The graph laplacianLet D be a diagonal matrix where T&#x304;i j Rows are normalized so: D= IThe graph laplacian is defined as : &#x304; T since f :V &#x2192; RThen we can use the graph laplacian to act on itSo the energy function can be rewritten in terms of
• 19. Back to the optimization Problem Energy can be rewritten using laplacianF should minimize the energy function. &#x304; &#x394;uu =( D uu &#x2212;T uu) &#x304; &#x394;uu =( I &#x2212;T uu) &#x304; &#x394; ul =( Dul &#x2212; T ul ) &#x304; &#x394; ul =&#x2212;T ul
• 20. Optimization Problem &#x304; &#x394;uu =( D uu &#x2212;T uu) Delta can be rewritten in terms of &#x304; T &#x304; &#x394; uu=( I &#x2212; T uu) &#x304; &#x394; ul =( Dul &#x2212; T ul ) &#x304; f u =( I &#x2212;T uu)T ul f l &#x304; &#x394; ul =&#x2212;T ulThe algorithm converges to theminimization of the Energy function
• 21. Sigma ParameterRemember the Sigma parameter? &#x25CF; It strongly influences the behavior of LP. &#x25CF; There can be: &#x25CF; just one &#x3C3; for the whole feature vector &#x25CF; One &#x3C3; per dimension
• 22. Sigma Parameter &#x25CF; What happens if &#x3C3; tends to be: &#x2013; 0: &#x25CF; The label of an unknown instance is given by just the nearest labeled instance &#x2013; Infinite &#x25CF; All the unlabaled instances receive the same influence from all labeled instances. The soft probabilities of each unlabeled instance is given by the class frecuency in the labeled data&#x25CF; There are heuristics for finding the appropiate value of sigma
• 23. Sigma Parameter - MST Label1 Label2This is the minimum arc connectingtwo components with differents labels (min weight (arc)) &#x3C3;= 3 Arc connects two components with different label
• 24. Sigma Parameter &#x2013; Learning it How to learn sigma? &#x25CF; Assumption : A good sigma will do classification with confidence and thus minimize entropy.How to do it? &#x25CF; Smoothing the transition Matrix T &#x25CF; Finding the derivative of H (the entropy) w.r.t to sigma When to do it? &#x25CF; when using a sigma for each dimension can be used to determine irrelevant dimensions
• 25. Labeling Approach&#x25CF; Once Yu is measured how do we assign labels to the instances? Yu&#x25CF; Take the most likely class&#x25CF; Class mass Normalization&#x25CF; Label Bidding
• 26. Labeling Approach &#x25CF; Take the most likely class &#x25CF; Simply, look at the rows of Yu, and choose for each instance the label with highest probability&#x25CF; Problem: no control on the proportion of classes
• 27. Labeling Approach&#x25CF; Class mass Normalization&#x25CF; Given some class proportions P 1 , P 2 ... P k&#x25CF; Scalate each column C to Pc &#x25CF; Then Simply, look at the rows of Yu, and choose for each instance the label with highest probability
• 28. Labeling Approach&#x25CF; Label bidding &#x25CF; Given some class proportions P 1 , P 2 ... P k1.estimate numbers of items per label (C k )2. choose the label with greatest number of items, take C kitems whose probabilty of being the current label is the highestand label as the current selected label.3. iterate through all the possible labels
• 29. Experiment Setup&#x25CF; Artificial Data &#x25CF; Comparison LP vs kNN (k=1)&#x25CF; Character recognition &#x25CF; Recognize handwritten digits &#x25CF; Images 16x16 pixels,gray scale &#x25CF; Recognizing 1,2,3. &#x25CF; 256 dimensional vector
• 30. Results using LP on artificial data
• 31. Results using LP on artificial data&#x25CF; LP finds the structure in the data while KNN fails
• 32. P1NN&#x25CF; P1NN is a baseline for comparisons&#x25CF; Simplified version of LP 1.During each iteration find the unlabeled instance nearest to a labeled instance and label it 2. Iterate until all instances are labeled
• 33. Results using LP on Handwritten dataSet&#x25CF; P1NN (BaseLine), 1NN (kNN) &#x25CF; Cne: Class mass normalization. Proportions from Labeled Data &#x25CF; Lbo: Label bidding with oracle class proportions &#x25CF; ML: most likely labels
• 34. Relation Extraction?&#x25CF; From natural language texts detect semantic relations among entitiesExample:B. Gates married Melinda French on January 1, 1994 spouse(B.Gates, Melinda French)
• 35. Why LP to do RE? Problems Supervised Unsupervised Retrieves clusters ofNeeds many relations with noannotated data label.
• 36. RE- Problem Definition &#x25CF; Find an appropiate label to an ocurrance of two entities in a contextExample:&#x2026;.. B. Gates married Melinda French on January 1, 1994Context(Cpre) Context Entity 2 Entity 1 (Cmid) Context (e2) (Cpos) (e1) Idea: if two ocurrances of entity pairs ahve similar Contexts, then they have same relation type
• 37. RE problem Definition - Features&#x25CF; Words: in the contexts&#x25CF; Entity Types: Person, Location, Org...&#x25CF; POS tagging: of Words in the contexts&#x25CF; Chunking Tag: mark which words in the contexts are inside chunks&#x25CF; Grammatical function of words in the contexts. i.e : NP-SBJ (subject)&#x25CF; Position of words: &#x25CF; First Word of e1 -is there any word in Cmid -first word in Cpre,Cmid,Cpost... &#x25CF; Second Word of e1.. -second word in Cpre...
• 38. RE problem Definition - Labels
• 39. Experiment&#x25CF; ACE 2003 data. Corpus from Newspapers&#x25CF; Assume all entities have been identified already&#x25CF; Comparison between: &#x2013; Differents amount of labeled samples 1%,10%,25,50%,75%,100% &#x2013; Different Similarity Functions &#x2013; LP, SVM and Bootstrapping&#x25CF; LP: &#x25CF; Similarity Function: Cosine, JensenShannon &#x25CF; Labeling Approach: Take the most likely class &#x25CF; Sigma: average similarity between labeled classes
• 40. ExperimentJensenShannon-Similarity Measure-Measure the distance between two probabilitiy functions-JS is a smoothing of Kullback-Leibler divergence DK L Kullback-Leibler divergence -not symmetric -not always has a finite value
• 41. Results
• 42. Classifying relation subtypes- SVM vs LP SVM with linear Kernel
• 43. Bootstrapping Train a ClassifierSeeds Classifier Update set of seeds whose confidence is high enough
• 44. Classifying relation types Bootstrapping vs LP Starting with 100 random seeds
• 45. Results&#x25CF; Performs well in general when there are few annotated data in comparison to SVM and kNN&#x25CF; Irrelevant dimensions can be identified by using LP&#x25CF; Looking at the structure of unlabeled data helps when there is few annotated data
• 46. Thank you