Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Semi-Supervised Classification of Network Data Using Very Few Labels<br />Frank Lin and William W. Cohen<br />School of Co...
Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br...
Preview<br />Classification labels are expensive to obtain<br />Semi-supervised learning (SSL) learns from labeled and unl...
Preview<br />[Adamic & Glance 2005]<br />
Preview<br />When it comes to network data, what is a general, simple, and effective method that requires very few labels?...
Preview<br />MRW (red) vs. a popular method (blue)<br />Only 1 training label per class!<br />accuracy<br /># of training ...
Preview<br />The popular method using authoritative seeding (red & green) vs. random seeding (blue)<br />label “authoritat...
Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br...
Random Walk with Restart<br />Imagine a network, and starting at a specific node, you follow the edges randomly.<br />But ...
Random Walk with Restart<br />What if we start at a different node?<br />Start node<br />
Random Walk with Restart<br />The walk distribution r satisfies a simple equation:<br />Start node(s)<br />Transition matr...
Random Walk with Restart<br />Random walk with restart (RWR) can be solved simply and efficiently with an iterative proced...
Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br...
RWR for Classification<br />Simple idea: use RWR for classification<br />RWR  with start nodes being labeled points in cla...
RWR for Classification<br />We refer to this method as MultiRankWalk: it classifies data with multiple rankings using rand...
Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br...
Seed Preference<br />Obtaining labels for data points is expensive<br />We want to minimize cost for obtaining labels<br /...
Seed Preference<br />Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data...
Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br...
Experiments<br />Test effectiveness of MRW and compare seed preferences on five real network datasets:<br />Political Blog...
Experiments<br />We compare MRW against a currently very popular network SSL method – wvRN<br />You may know wvRN as the h...
Experiments<br />To simulate a human expert labeling data, we use the “ranked-at-least-n-per-class” method<br />Political ...
Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br...
Results<br />Averaged over 20 runs<br />MRW vs. wvRN with random seed preference<br />MRW does extremely well with just on...
Results<br />wvRN with different seed preferences<br />LinkCount or PageRank much better than Random with smaller number o...
Results<br />Does MRW benefit from seed preference?<br />A rare instance where authoritative seeds hurt performance, but n...
Results<br />How much better is MRW using authoritative seed preference?<br />y-axis:<br />MRW F1 score minus wvRN F1<br /...
Results<br />Summary<br />MRW much better than wvRN with small number of seed labels<br />MRW more robust to varying quali...
Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br...
The Question<br />What really makes MRW and wvRN different?<br />Network-based SSL often boil down to label propagation. <...
The Question<br />It’s difficult to answer exactly why MRW does better with a smaller number of seeds.<br />But we can gat...
The Question<br />1. Centrality-sensitive: seeds have different scores and not necessarily the highest<br />Seed labels un...
Questions?<br />
Related Work<br />MRW is very much related to<br />“Local and global consistency” (Zhou et al. 2004)<br />“Web content cat...
Upcoming SlideShare
Loading in …5
×

Semi-Supervised Classification of Network Data Using Very Few Labels

1,092 views

Published on

  • Be the first to comment

Semi-Supervised Classification of Network Data Using Very Few Labels

  1. 1. Semi-Supervised Classification of Network Data Using Very Few Labels<br />Frank Lin and William W. Cohen<br />School of Computer Science, Carnegie Mellon University<br />ASONAM 2010<br />2010-08-11, Odense, Denmark<br />
  2. 2. Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br />Experiments<br />Results<br />The Question<br />
  3. 3. Preview<br />Classification labels are expensive to obtain<br />Semi-supervised learning (SSL) learns from labeled and unlabeled data for classification<br />
  4. 4. Preview<br />[Adamic & Glance 2005]<br />
  5. 5. Preview<br />When it comes to network data, what is a general, simple, and effective method that requires very few labels?<br />One that researchers could use as a strong baseline when developing more complex and domain-specific methods?<br />Our Answer:<br />MultiRankWalk (MRW)<br />&<br />Label high PageRank nodes first (authoritative seeding)<br />
  6. 6. Preview<br />MRW (red) vs. a popular method (blue)<br />Only 1 training label per class!<br />accuracy<br /># of training labels<br />
  7. 7. Preview<br />The popular method using authoritative seeding (red & green) vs. random seeding (blue)<br />label “authoritative seeds” first<br />Same blue line as before<br />
  8. 8. Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br />Experiments<br />Results<br />The Question<br />
  9. 9. Random Walk with Restart<br />Imagine a network, and starting at a specific node, you follow the edges randomly.<br />But (perhaps you’re afraid of wondering too far) with some probability, you “jump” back to the starting node (restart!).<br />If you record the number of times you land on each node, what would that distribution look like? <br />
  10. 10. Random Walk with Restart<br />What if we start at a different node?<br />Start node<br />
  11. 11. Random Walk with Restart<br />The walk distribution r satisfies a simple equation:<br />Start node(s)<br />Transition matrix of the network<br />Equivalent to the well-known PageRank ranking if all nodes are start nodes! (u is uniform)<br />Restart probability<br />“Keep-going” probability (damping factor)<br />
  12. 12. Random Walk with Restart<br />Random walk with restart (RWR) can be solved simply and efficiently with an iterative procedure:<br />
  13. 13. Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br />Experiments<br />Results<br />The Question<br />
  14. 14. RWR for Classification<br />Simple idea: use RWR for classification<br />RWR with start nodes being labeled points in class A<br />RWR with start nodes being labeled points in class B<br />Nodes frequented more by RWR(A) belongs to class A, otherwise they belong to B<br />
  15. 15. RWR for Classification<br />We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks<br />
  16. 16. Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br />Experiments<br />Results<br />The Question<br />
  17. 17. Seed Preference<br />Obtaining labels for data points is expensive<br />We want to minimize cost for obtaining labels<br />Observations:<br />Some labels inherently more useful than others<br />Some labels easier to obtain than others<br />Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more useful than others?<br />
  18. 18. Seed Preference<br />Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label<br />The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data<br />We consider 3 preferences:<br />Random<br />Link Count<br />PageRank<br />Nodes with highest counts make the list <br />Nodes with highest scores make the list <br />
  19. 19. Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br />Experiments<br />Results<br />The Question<br />
  20. 20. Experiments<br />Test effectiveness of MRW and compare seed preferences on five real network datasets:<br />Political Blogs (Liberal vs. Conservative)<br />Citation Networks (7 and 6 academic fields, respectively)<br />
  21. 21. Experiments<br />We compare MRW against a currently very popular network SSL method – wvRN<br />You may know wvRN as the harmonic functions method, adsorption, random walk with sink nodes, …<br />“weighted-voted relational network classifier”<br />Recommended as a strong network SSL baseline in (Macskassy & Provost 2007)<br />
  22. 22. Experiments<br />To simulate a human expert labeling data, we use the “ranked-at-least-n-per-class” method<br />Political blog example with n=2:<br />conservative<br />liberal<br />conservative<br />conservative<br />liberal<br />blogsforbush.com<br />dailykos.com<br />moorewatch.com<br />right-thinking.com<br />talkingpointsmemo.com<br />instapundit.com<br />michellemalkin.com<br />atrios.blogspot.com<br />littlegreenfootballs.com<br />washingtonmonthly.com<br />powerlineblog.com<br />drudgereport.com<br />We have at least 2 labels per class. Stop.<br />
  23. 23. Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br />Experiments<br />Results<br />The Question<br />
  24. 24. Results<br />Averaged over 20 runs<br />MRW vs. wvRN with random seed preference<br />MRW does extremely well with just one randomly selected label per class!<br />MRW drastically better with a small number of seed labels; performance not significantly different with larger numbers of seeds<br />
  25. 25. Results<br />wvRN with different seed preferences<br />LinkCount or PageRank much better than Random with smaller number of seed labels<br />PageRank slightly better than LinkCount, but in general not significantly so<br />
  26. 26. Results<br />Does MRW benefit from seed preference?<br />A rare instance where authoritative seeds hurt performance, but not statistically significant<br />Yes, on certain datasets with small number of seed labels; note the already very high F1 on most datasets<br />
  27. 27. Results<br />How much better is MRW using authoritative seed preference?<br />y-axis:<br />MRW F1 score minus wvRN F1<br />x-axis: number of seed labels per class<br />The gap between MRW and wvRN narrows with authoritative seeds, but they are still prominent on some datasets with small number of seed labels<br />
  28. 28. Results<br />Summary<br />MRW much better than wvRN with small number of seed labels<br />MRW more robust to varying quality of seed labels than wvRN<br />Authoritative seed preference boosts algorithm effectiveness with small number of seed labels <br />We recommend MRW and authoritative seed preference as a strong baseline for semi-supervised classification on network data<br />
  29. 29. Overview<br />Preview<br />MultiRankWalk<br />Random Walk with Restart<br />RWR for Classification<br />Seed Preference<br />Experiments<br />Results<br />The Question<br />
  30. 30. The Question<br />What really makes MRW and wvRN different?<br />Network-based SSL often boil down to label propagation. <br />MRW and wvRN represent two general propagation methods – note that they are call by many names:<br />Great…but we still don’t know why the differences in their behavior on these network datasets!<br />
  31. 31. The Question<br />It’s difficult to answer exactly why MRW does better with a smaller number of seeds.<br />But we can gather probable factors from their propagation models:<br />
  32. 32. The Question<br />1. Centrality-sensitive: seeds have different scores and not necessarily the highest<br />Seed labels underlined<br />An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative:<br />1.000 neoconservatives.blogspot.com<br />1.000 strangedoctrines.typepad.com<br />1.000 jmbzine.com<br />0.593 presidentboxer.blogspot.com<br />0.585 rooksrant.com<br />0.568 purplestates.blogspot.com<br />0.553 ikilledcheguevara.blogspot.com<br />0.540 restoreamerica.blogspot.com<br />0.539 billrice.org<br />0.529 kalblog.com<br />0.517 right-thinking.com<br />0.517 tom-hanna.org<br />0.514 crankylittleblog.blogspot.com<br />0.510 hasidicgentile.org<br />0.509 stealthebandwagon.blogspot.com<br />0.509 carpetblogger.com<br />0.497 politicalvicesquad.blogspot.com<br />0.496 nerepublican.blogspot.com<br />0.494 centinel.blogspot.com<br />0.494 scrawlville.com<br />0.493 allspinzone.blogspot.com<br />0.492 littlegreenfootballs.com<br />0.492 wehavesomeplanes.blogspot.com<br />0.491 rittenhouse.blogspot.com<br />0.490 secureliberty.org<br />0.488 decision08.blogspot.com<br />0.488 larsonreport.com<br />0.020 firstdownpolitics.com<br />0.019 neoconservatives.blogspot.com<br />0.017 jmbzine.com<br />0.017 strangedoctrines.typepad.com<br />0.013 millers_time.typepad.com<br />0.011 decision08.blogspot.com<br />0.010 gopandcollege.blogspot.com<br />0.010 charlineandjamie.com<br />0.008 marksteyn.com<br />0.007 blackmanforbush.blogspot.com<br />0.007 reggiescorner.blogspot.com<br />0.007 fearfulsymmetry.blogspot.com<br />0.006 quibbles-n-bits.com<br />0.006 undercaffeinated.com<br />0.005 samizdata.net<br />0.005 pennywit.com<br />0.005 pajamahadin.com<br />0.005 mixtersmix.blogspot.com<br />0.005 stillfighting.blogspot.com<br />0.005 shakespearessister.blogspot.com<br />0.005 jadbury.com<br />0.005 thefulcrum.blogspot.com<br />0.005 watchandwait.blogspot.com<br />0.005 gindy.blogspot.com<br />0.005 cecile.squarespace.com<br />0.005 usliberals.about.com<br />0.005 twentyfirstcenturyrepublican.blogspot.com<br />2. Exponential drop-off: much less sure about nodes further away from seeds<br />We still don’t completely understand it yet.<br />3. Classes propagate independently: charlineandjamie.com is both very likely a conservative and a liberal blog (good or bad?)<br />
  33. 33. Questions?<br />
  34. 34. Related Work<br />MRW is very much related to<br />“Local and global consistency” (Zhou et al. 2004)<br />“Web content categorization using link information” (Gyongyi et al. 2006)<br />“Graph-based semi-supervised learning as a generative model” (He et al. 2007)<br />Seed preference is related to the field of active learning<br />Active learning chooses which data point to label next based on previous labels; the labeling is interactive<br />Seed preference is a batch labeling method<br />Random walk without restart, heuristic stopping<br />RWR ranking as features to SVM<br />Similar formulation, different view<br />Authoritative seed preference a good base line for active learning on network data!<br />

×