Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Community Detection in Social Media by Symeon Papadopoulos 15524 views
- Extending Word2Vec for Performance ... by Spark Summit 4775 views
- Semi-Supervised Learning by Lukas Tencer 3377 views
- Semi supervised learning by Ahmed Taha 942 views
- Community detection in graphs by Nicola Barbieri 9294 views
- Language of Politics on Twitter - 0... by Yelena Mejova 705 views

10,474 views

Published on

Label Propagation

Published in:
Education

No Downloads

Total views

10,474

On SlideShare

0

From Embeds

0

Number of Embeds

523

Shares

0

Downloads

325

Comments

0

Likes

12

No embeds

No notes for slide

- 1. Label Propagation Seminar:Semi-supervised and unsupervised learning with Applications to NLP David Przybilla davida@coli.uni-saarland.de
- 2. Outline● What is Label Propagation● The Algorithm● The motivation behind the algorithm● Parameters of Label Propagation● Relation Extraction with Label Propagation
- 3. Label Propagation● Semi-supervised● Shows good results when the amount of annotated data is low with respect to the supervised options● Similar to kNN
- 4. K-Nearest Neighbors(KNN) ● Shares similar ideas with Label Propagation ● Label Propagation (LP) uses unlabeled instances during the process of finding out the labels
- 5. Idea of the Problem Similar near Unlabeled Instances should have similar Labels L=set of Labeled Instances U =set of Unlabeled InstancesWe want to find a function f such that:
- 6. The Model● A complete graph ● Each Node is an instance ● Each arc has a weight T xy ● T xy is high if Nodes x and y are similar.
- 7. The Model● Inside a Node: Soft Labels
- 8. Variables - Model ● T is a matrix, holding all the weights of the graph N 1 ... N l = Labeled Data TllTlu N l+1 .. N n=Unlabeled Data T u lT u uTllTluT ulT uu
- 9. Variables - Model● Y is a matrix, holding the soft probabilities of each instance YN a n , R b is the probability of a being labeled as R b YL YU The problem to solveR1 , R 2 ... R k each of the possible labelsN 1 , N 2 ... N n each of the instances to label
- 10. Algorithm Y will change in each iteration
- 11. How to Measure T? Distance Measure Euclidean DistanceImportant Parameter(ignore it at the moment) we will talk about this later
- 12. How to Initialize Y? 0 ● How to Correctly set the values of Y ? ● Fill the known values (of the labeled data) ● How to fill the values of the unlabeled data? → The initialization of this values can be arbitrary.● Transform T into T (row normalization)
- 13. Propagation Step● During the process Y will change 0 1 k Y → Y → ... → Y ● Update Y during each iteration
- 14. ConvergenceDuring the iteration Clamped Yl ̄ T l l T̄l u Yl = Yu T̄u l T̄ u u Yu Assumming we iterate infinite times then: 1 Y =T U ̄uu Y 0+ T ul Y L u ̄ 2 Y =T U ̄uu ( T̄uu Y 0 + T ul Y L )+T ul Y L u ̄ ̄ ...
- 15. Convergence ̄Since T is normalized and ̄ is a submatrix of T:Doing it n times will lead to: Converges to Zero
- 16. After convergenceAfter convergence one can find by solving: =
- 17. Optimization Problem w i j : Similarity between i j F should minimize the energy functionf (i ) and f ( j) should be similar for a high w i j in order to minimize
- 18. The graph laplacianLet D be a diagonal matrix where T̄i j Rows are normalized so: D= IThe graph laplacian is defined as : ̄ T since f :V → RThen we can use the graph laplacian to act on itSo the energy function can be rewritten in terms of
- 19. Back to the optimization Problem Energy can be rewritten using laplacianF should minimize the energy function. ̄ Δuu =( D uu −T uu) ̄ Δuu =( I −T uu) ̄ Δ ul =( Dul − T ul ) ̄ Δ ul =−T ul
- 20. Optimization Problem ̄ Δuu =( D uu −T uu) Delta can be rewritten in terms of ̄ T ̄ Δ uu=( I − T uu) ̄ Δ ul =( Dul − T ul ) ̄ f u =( I −T uu)T ul f l ̄ Δ ul =−T ulThe algorithm converges to theminimization of the Energy function
- 21. Sigma ParameterRemember the Sigma parameter? ● It strongly influences the behavior of LP. ● There can be: ● just one σ for the whole feature vector ● One σ per dimension
- 22. Sigma Parameter ● What happens if σ tends to be: – 0: ● The label of an unknown instance is given by just the nearest labeled instance – Infinite ● All the unlabaled instances receive the same influence from all labeled instances. The soft probabilities of each unlabeled instance is given by the class frecuency in the labeled data● There are heuristics for finding the appropiate value of sigma
- 23. Sigma Parameter - MST Label1 Label2This is the minimum arc connectingtwo components with differents labels (min weight (arc)) σ= 3 Arc connects two components with different label
- 24. Sigma Parameter – Learning it How to learn sigma? ● Assumption : A good sigma will do classification with confidence and thus minimize entropy.How to do it? ● Smoothing the transition Matrix T ● Finding the derivative of H (the entropy) w.r.t to sigma When to do it? ● when using a sigma for each dimension can be used to determine irrelevant dimensions
- 25. Labeling Approach● Once Yu is measured how do we assign labels to the instances? Yu● Take the most likely class● Class mass Normalization● Label Bidding
- 26. Labeling Approach ● Take the most likely class ● Simply, look at the rows of Yu, and choose for each instance the label with highest probability● Problem: no control on the proportion of classes
- 27. Labeling Approach● Class mass Normalization● Given some class proportions P 1 , P 2 ... P k● Scalate each column C to Pc ● Then Simply, look at the rows of Yu, and choose for each instance the label with highest probability
- 28. Labeling Approach● Label bidding ● Given some class proportions P 1 , P 2 ... P k1.estimate numbers of items per label (C k )2. choose the label with greatest number of items, take C kitems whose probabilty of being the current label is the highestand label as the current selected label.3. iterate through all the possible labels
- 29. Experiment Setup● Artificial Data ● Comparison LP vs kNN (k=1)● Character recognition ● Recognize handwritten digits ● Images 16x16 pixels,gray scale ● Recognizing 1,2,3. ● 256 dimensional vector
- 30. Results using LP on artificial data
- 31. Results using LP on artificial data● LP finds the structure in the data while KNN fails
- 32. P1NN● P1NN is a baseline for comparisons● Simplified version of LP 1.During each iteration find the unlabeled instance nearest to a labeled instance and label it 2. Iterate until all instances are labeled
- 33. Results using LP on Handwritten dataSet● P1NN (BaseLine), 1NN (kNN) ● Cne: Class mass normalization. Proportions from Labeled Data ● Lbo: Label bidding with oracle class proportions ● ML: most likely labels
- 34. Relation Extraction?● From natural language texts detect semantic relations among entitiesExample:B. Gates married Melinda French on January 1, 1994 spouse(B.Gates, Melinda French)
- 35. Why LP to do RE? Problems Supervised Unsupervised Retrieves clusters ofNeeds many relations with noannotated data label.
- 36. RE- Problem Definition ● Find an appropiate label to an ocurrance of two entities in a contextExample:….. B. Gates married Melinda French on January 1, 1994Context(Cpre) Context Entity 2 Entity 1 (Cmid) Context (e2) (Cpos) (e1) Idea: if two ocurrances of entity pairs ahve similar Contexts, then they have same relation type
- 37. RE problem Definition - Features● Words: in the contexts● Entity Types: Person, Location, Org...● POS tagging: of Words in the contexts● Chunking Tag: mark which words in the contexts are inside chunks● Grammatical function of words in the contexts. i.e : NP-SBJ (subject)● Position of words: ● First Word of e1 -is there any word in Cmid -first word in Cpre,Cmid,Cpost... ● Second Word of e1.. -second word in Cpre...
- 38. RE problem Definition - Labels
- 39. Experiment● ACE 2003 data. Corpus from Newspapers● Assume all entities have been identified already● Comparison between: – Differents amount of labeled samples 1%,10%,25,50%,75%,100% – Different Similarity Functions – LP, SVM and Bootstrapping● LP: ● Similarity Function: Cosine, JensenShannon ● Labeling Approach: Take the most likely class ● Sigma: average similarity between labeled classes
- 40. ExperimentJensenShannon-Similarity Measure-Measure the distance between two probabilitiy functions-JS is a smoothing of Kullback-Leibler divergence DK L Kullback-Leibler divergence -not symmetric -not always has a finite value
- 41. Results
- 42. Classifying relation subtypes- SVM vs LP SVM with linear Kernel
- 43. Bootstrapping Train a ClassifierSeeds Classifier Update set of seeds whose confidence is high enough
- 44. Classifying relation types Bootstrapping vs LP Starting with 100 random seeds
- 45. Results● Performs well in general when there are few annotated data in comparison to SVM and kNN● Irrelevant dimensions can be identified by using LP● Looking at the structure of unlabeled data helps when there is few annotated data
- 46. Thank you

No public clipboards found for this slide

Be the first to comment