Machine learning for the Web:

Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering IIT Bombay www.cse.iitb.ernet.in/~soumen

Traditional supervised learning Training instance Test instance Independent variables x mostly continuous, maybe categorical Predicted variable y discrete (classification) or continuous (regression) Statistical models, inference rules, or separators Learner Learner Prediction y

Traditional unsupervised learning No training / testing phases Input is a collection of records with independent attributes alone Measure of similarity Partition or cover instances using clusters with large “self-similarity” and small “cross-similarity” Hierarchical partitions Large self- similarity Small cross- similarity

Learning hypertext models Entities are pages, sites, paragraphs, links, people, bookmarks, clickstreams… Transformed into simple models and relations Vector space/bag-of-words Hyperlink graph Topic directories Discrete time series occurs(term, page, cnt) cites(page, page) is-a(topic, topic) example(topic, page)

Challenges Large feature space in raw data Structured data sets: 10s to 100s Text (Web): 50 to 100 thousand Most features not completely useless Feature elimination / selection not perfect Beyond linear transformations? Models used today are simplistic Good accuracy on simple labeling tasks Lose a lot of detail present in hypertext to fit known learning techniques

Challenges Complex, interrelated objects Not a structured tuple-like entity Explicit and implicit connections Document markup sub-structure Site boundaries and hyperlinks Placement in popular directories like Yahoo! Traditional distance measures are noisy How to combine diverse features? (Or, a link is worth a ? words) Unreliable clustering results

This session Semi-supervised clustering (Rich Caruana) Enhanced clustering via user feedback Kernel methods (Nello Cristianini) Modular learning systems for text and hypertext Reference matching(Andrew McCallum) Recovering and cleaning implicit citation graphs from unstructured data

This talk: Two examples Learning topics of hypertext documents Semi-supervised learning scenario Unified model of text and hyperlinks Enhanced accuracy of topic labeling Segmenting hierarchical tagged pages Topic distillation (hubs and authorities) Minimum description length segmentation Better focused topic distillation Extract relevant fragments from pages

Classifying interconnected entities Early examples: Some diseases have complex lineage dependency Robust edge detection in images How are topics interconnected in hypertext? Maximum likelihood graph labeling with many classes Finding edge pixels in a differentiated image ? ? ? ? ? ? .3 red .7 blue

Naïve Bayes classifiers Decide topic; topic c is picked with prior probability  ( c );  c  ( c ) = 1 Each c has parameters  ( c , t ) for terms t Coin with face probabilities  t  ( c , t ) = 1 Fix document length n ( d ) and toss coin Naïve yet effective; can use other algos Given c , probability of document is

Enhanced models for hypertext c =class, d =text, N =neighbors Text-only model: Pr( d | c ) Using neighbors’ text to judge my topic: Pr( d , d ( N ) | c ) Better recursive model: Pr( d , c ( N ) | c ) Relaxation labeling over Markov random fields Or, EM formulation ?

Hyperlink modeling boosts accuracy 9600 patents from 12 classes marked by USPTO Patents have text and prior art links Expand test patent to include neighborhood ‘ Forget’ and re-estimate fraction of neighbors’ classes (Even better for Yahoo)

Hyperlink Induced Topic Search Radius-1 expanded graph Response Keyword Search engine Query a = E T h h = Ea ‘ Hubs’ and ‘ authorities’ h a h h h a a a

“Topic drift” and various fixes Some hubs have ‘mixed’ content Authority ‘leaks’ through mixed hubs from good to bad pages Clever: match query with anchor text to favor some edges B&H: eliminate outlier documents Vector-space document model Centroid × Cut-off radius Query term Activation window ‘ Thick’ links

Document object model (DOM) Hierarchical graph model for semi-structured data Can extract reasonable DOM from HTML A fine-grained view of the Web Valuable because page boundaries are less meaningful now <html><head> <title>Portals</title> </head><body><ul> <li><a href=“…”>Yahoo</a></li> <li><a href=“…”>Lycos</a></li> </ul></body></html> html head body title ul li li a a

A model for hub generation Global hub score distribution  0 w.r.t. given query Authors use DOM nodes to specialize  0 into local  I At a certain ‘cut’ in the DOM tree, local distribution directly generates hub scores Global distribution Progressive ‘distortion’ Model frontier Other pages

Optimizing a cost measure H v v Reference distribution  0 Data encoding cost is roughly Distribution distortion cost is (for Poisson distribution)

Modified topic distillation algorithm Will this (non-linear) system converge? Will segmentation help in reducing drift? Initialize DOM graph Let only root set authority scores be 1 Repeat until reasonable convergence: Authority-to-hub score propagation MDL-based hub score smoothing Hub-to-authority score propagation Normalization of authority scores Segment and rank micro-hubs Present annotated results

Convergence 28 queries used in Clever and by B&H 366k macro-pages, 10M micro-links Rank converges within 15 iterations

Effect of micro-hub segmentation ‘ Expanded’ implies authority diffusion arrested As nodes outside rootset start participating in the distillation… #Expanded increases #Pruned decreases Prevents authority leaks via mixed hubs

Rank correlation with B&H Positively correlated Some negative deviations Pseudo- authorities downgraded by our algorithm These were earlier favored by mixed hubs (Axes not to same scale)

Conclusion Hypertext and the Web pose new modeling and algorithmic challenges Locality exists in many guises Diverse sources of information: text, links, markup, usage Unifying models needed Anecdotes suggest that synergy can be exploited

Machine learning for the Web:

More Related Content

What's hot

Viewers also liked

Similar to Machine learning for the Web:

More from butest

Machine learning for the Web:

Editor's Notes