Report

David PrzybillaFollow

Jan. 24, 2013•0 likes## 19 likes

•14,853 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Jan. 24, 2013•0 likes## 19 likes

•14,853 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Download to read offline

Report

Education

Label Propagation

David PrzybillaFollow

【論文読み会】Alias-Free Generative Adversarial Networks（StyleGAN3）ARISE analytics

あらゆる風車に適用可能な状態監視技術を目指して～風車主要機器におけるデータ駆動型異常検知とその評価～pcl-lab

Invariant Information Clustering for Unsupervised Image Classification and Se...harmonylab

確率的主成分分析Mika Yoshimura

[DL輪読会]Deep Neural Networks as Gaussian ProcessesDeep Learning JP

Generative Adversarial Networks (GAN) の学習方法進展・画像生成・教師なし画像変換Koichi Hamada

- Label Propagation Seminar: Semi-supervised and unsupervised learning with Applications to NLP David Przybilla davida@coli.uni-saarland.de
- Outline ● What is Label Propagation ● The Algorithm ● The motivation behind the algorithm ● Parameters of Label Propagation ● Relation Extraction with Label Propagation
- Label Propagation ● Semi-supervised ● Shows good results when the amount of annotated data is low with respect to the supervised options ● Similar to kNN
- K-Nearest Neighbors(KNN) ● Shares similar ideas with Label Propagation ● Label Propagation (LP) uses unlabeled instances during the process of finding out the labels
- Idea of the Problem Similar near Unlabeled Instances should have similar Labels L=set of Labeled Instances U =set of Unlabeled Instances We want to find a function f such that:
- The Model ● A complete graph ● Each Node is an instance ● Each arc has a weight T xy ● T xy is high if Nodes x and y are similar.
- The Model ● Inside a Node: Soft Labels
- Variables - Model ● T is a matrix, holding all the weights of the graph N 1 ... N l = Labeled Data TllTlu N l+1 .. N n=Unlabeled Data T u lT u u Tll Tlu T ul T uu
- Variables - Model ● Y is a matrix, holding the soft probabilities of each instance YN a n , R b is the probability of a being labeled as R b YL YU The problem to solve R1 , R 2 ... R k each of the possible labels N 1 , N 2 ... N n each of the instances to label
- Algorithm Y will change in each iteration
- How to Measure T? Distance Measure Euclidean Distance Important Parameter (ignore it at the moment) we will talk about this later
- How to Initialize Y? 0 ● How to Correctly set the values of Y ? ● Fill the known values (of the labeled data) ● How to fill the values of the unlabeled data? → The initialization of this values can be arbitrary. ● Transform T into T' (row normalization)
- Propagation Step ● During the process Y will change 0 1 k Y → Y → ... → Y ● Update Y during each iteration
- Convergence During the iteration Clamped Yl ̄ T l l T̄l u Yl = Yu T̄u l T̄ u u Yu Assumming we iterate infinite times then: 1 Y =T U ̄uu Y 0+ T ul Y L u ̄ 2 Y =T U ̄uu ( T̄uu Y 0 + T ul Y L )+T ul Y L u ̄ ̄ ...
- Convergence ̄ Since T is normalized and ̄ is a submatrix of T: Doing it n times will lead to: Converges to Zero
- After convergence After convergence one can find by solving: =
- Optimization Problem w i j : Similarity between i j F should minimize the energy function f (i ) and f ( j) should be similar for a high w i j in order to minimize
- The graph laplacian Let D be a diagonal matrix where T̄i j Rows are normalized so: D= I The graph laplacian is defined as : ̄ T since f :V → R Then we can use the graph laplacian to act on it So the energy function can be rewritten in terms of
- Back to the optimization Problem Energy can be rewritten using laplacian F should minimize the energy function. ̄ Δuu =( D uu −T uu) ̄ Δuu =( I −T uu) ̄ Δ ul =( Dul − T ul ) ̄ Δ ul =−T ul
- Optimization Problem ̄ Δuu =( D uu −T uu) Delta can be rewritten in terms of ̄ T ̄ Δ uu=( I − T uu) ̄ Δ ul =( Dul − T ul ) ̄ f u =( I −T uu)T ul f l ̄ Δ ul =−T ul The algorithm converges to the minimization of the Energy function
- Sigma Parameter Remember the Sigma parameter? ● It strongly influences the behavior of LP. ● There can be: ● just one σ for the whole feature vector ● One σ per dimension
- Sigma Parameter ● What happens if σ tends to be: – 0: ● The label of an unknown instance is given by just the nearest labeled instance – Infinite ● All the unlabaled instances receive the same influence from all labeled instances. The soft probabilities of each unlabeled instance is given by the class frecuency in the labeled data ● There are heuristics for finding the appropiate value of sigma
- Sigma Parameter - MST Label1 Label2 This is the minimum arc connecting two components with differents labels (min weight (arc)) σ= 3 Arc connects two components with different label
- Sigma Parameter – Learning it How to learn sigma? ● Assumption : A good sigma will do classification with confidence and thus minimize entropy. How to do it? ● Smoothing the transition Matrix T ● Finding the derivative of H (the entropy) w.r.t to sigma When to do it? ● when using a sigma for each dimension can be used to determine irrelevant dimensions
- Labeling Approach ● Once Yu is measured how do we assign labels to the instances? Yu ● Take the most likely class ● Class mass Normalization ● Label Bidding
- Labeling Approach ● Take the most likely class ● Simply, look at the rows of Yu, and choose for each instance the label with highest probability ● Problem: no control on the proportion of classes
- Labeling Approach ● Class mass Normalization ● Given some class proportions P 1 , P 2 ... P k ● Scalate each column C to Pc ● Then Simply, look at the rows of Yu, and choose for each instance the label with highest probability
- Labeling Approach ● Label bidding ● Given some class proportions P 1 , P 2 ... P k 1.estimate numbers of items per label (C k ) 2. choose the label with greatest number of items, take C k items whose probabilty of being the current label is the highest and label as the current selected label. 3. iterate through all the possible labels
- Experiment Setup ● Artificial Data ● Comparison LP vs kNN (k=1) ● Character recognition ● Recognize handwritten digits ● Images 16x16 pixels,gray scale ● Recognizing 1,2,3. ● 256 dimensional vector
- Results using LP on artificial data
- Results using LP on artificial data ● LP finds the structure in the data while KNN fails
- P1NN ● P1NN is a baseline for comparisons ● Simplified version of LP 1.During each iteration find the unlabeled instance nearest to a labeled instance and label it 2. Iterate until all instances are labeled
- Results using LP on Handwritten dataSet ● P1NN (BaseLine), 1NN (kNN) ● Cne: Class mass normalization. Proportions from Labeled Data ● Lbo: Label bidding with oracle class proportions ● ML: most likely labels
- Relation Extraction? ● From natural language texts detect semantic relations among entities Example: B. Gates married Melinda French on January 1, 1994 spouse(B.Gates, Melinda French)
- Why LP to do RE? Problems Supervised Unsupervised Retrieves clusters of Needs many relations with no annotated data label.
- RE- Problem Definition ● Find an appropiate label to an ocurrance of two entities in a context Example: ….. B. Gates married Melinda French on January 1, 1994 Context (Cpre) Context Entity 2 Entity 1 (Cmid) Context (e2) (Cpos) (e1) Idea: if two ocurrances of entity pairs ahve similar Contexts, then they have same relation type
- RE problem Definition - Features ● Words: in the contexts ● Entity Types: Person, Location, Org... ● POS tagging: of Words in the contexts ● Chunking Tag: mark which words in the contexts are inside chunks ● Grammatical function of words in the contexts. i.e : NP-SBJ (subject) ● Position of words: ● First Word of e1 -is there any word in Cmid -first word in Cpre,Cmid,Cpost... ● Second Word of e1.. -second word in Cpre...
- RE problem Definition - Labels
- Experiment ● ACE 2003 data. Corpus from Newspapers ● Assume all entities have been identified already ● Comparison between: – Differents amount of labeled samples 1%,10%,25,50%,75%,100% – Different Similarity Functions – LP, SVM and Bootstrapping ● LP: ● Similarity Function: Cosine, JensenShannon ● Labeling Approach: Take the most likely class ● Sigma: average similarity between labeled classes
- Experiment JensenShannon -Similarity Measure -Measure the distance between two probabilitiy functions -JS is a smoothing of Kullback-Leibler divergence DK L Kullback-Leibler divergence -not symmetric -not always has a finite value
- Results
- Classifying relation subtypes- SVM vs LP SVM with linear Kernel
- Bootstrapping Train a Classifier Seeds Classifier Update set of seeds whose confidence is high enough
- Classifying relation types Bootstrapping vs LP Starting with 100 random seeds
- Results ● Performs well in general when there are few annotated data in comparison to SVM and kNN ● Irrelevant dimensions can be identified by using LP ● Looking at the structure of unlabeled data helps when there is few annotated data
- Thank you