Your SlideShare is downloading. ×
0
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
SLIDES: Hypertext Data Mining: A Tutorial Survey
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

SLIDES: Hypertext Data Mining: A Tutorial Survey

661

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
661
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • RDBMS vs. custom storage managers and the space-time tradeoff.
  • Transcript

    • 1. Hypertext data mining A tutorial survey Soumen Chakrabarti Indian Institute of Technology Bombay http://www.cse.iitb.ac.in/~soumen [email_address]
    • 2. Hypertext databases <ul><li>Academia </li></ul><ul><ul><li>Digital library, web publication </li></ul></ul><ul><li>Consumer </li></ul><ul><ul><li>Newsgroups, communities, product reviews </li></ul></ul><ul><li>Industry and organizations </li></ul><ul><ul><li>Health care, customer service </li></ul></ul><ul><ul><li>Corporate email </li></ul></ul><ul><li>An inherently collaborative medium </li></ul><ul><li>Bigger than the sum of its parts </li></ul>
    • 3. The Web <ul><li>2 billion HTML pages, several terabytes </li></ul><ul><li>Highly dynamic </li></ul><ul><ul><li>1 million new pages per day </li></ul></ul><ul><ul><li>Over 600 GB of pages change per month </li></ul></ul><ul><ul><li>Average page changes in a few weeks </li></ul></ul><ul><li>Largest crawlers </li></ul><ul><ul><li>Refresh less than 18% in a few weeks </li></ul></ul><ul><ul><li>Cover less than 50% ever </li></ul></ul><ul><li>Average page has 7 – 10 links </li></ul><ul><ul><li>Links form content-based communities </li></ul></ul>
    • 4. The role of data mining <ul><li>Search and measures of similarity </li></ul><ul><li>Unsupervised learning </li></ul><ul><ul><li>Automatic topic taxonomy generation </li></ul></ul><ul><li>(Semi-) supervised learning </li></ul><ul><ul><li>Taxonomy maintenance, content filtering </li></ul></ul><ul><li>Collaborative recommendation </li></ul><ul><ul><li>Static page contents </li></ul></ul><ul><ul><li>Dynamic page visit behavior </li></ul></ul><ul><li>Hyperlink graph analyses </li></ul><ul><ul><li>Notions of centrality and prestige </li></ul></ul>
    • 5. Differences from structured data <ul><li>Document  rows and columns </li></ul><ul><ul><li>Extended complex objects </li></ul></ul><ul><ul><li>Links and relations to other objects </li></ul></ul><ul><li>Document  XML graph </li></ul><ul><ul><li>Combine models and analyses for attributes, elements, and CDATA </li></ul></ul><ul><ul><li>Models different from structured scenario </li></ul></ul><ul><li>Very high dimensionality </li></ul><ul><ul><li>Tens of thousands as against dozens </li></ul></ul><ul><ul><li>Sparse: most dimensions absent/irrelevant </li></ul></ul><ul><li>Complex taxonomies and ontologies </li></ul>
    • 6. The sublime and the ridiculous <ul><li>What is the exact circumference of a circle of radius one inch? </li></ul><ul><li>Is the distance between Tokyo and Rome more than 6000 miles? </li></ul><ul><li>What is the distance between Tokyo and Rome? </li></ul><ul><li>java </li></ul><ul><li>java +coffee -applet </li></ul><ul><li>“ uninterrupt* power suppl*” ups -parcel </li></ul>
    • 7. Search products and services <ul><li>Verity </li></ul><ul><li>Fulcrum </li></ul><ul><li>PLS </li></ul><ul><li>Oracle text extender </li></ul><ul><li>DB2 text extender </li></ul><ul><li>Infoseek Intranet </li></ul><ul><li>SMART (academic) </li></ul><ul><li>Glimpse (academic) </li></ul><ul><li>Inktomi (HotBot) </li></ul><ul><li>Alta Vista </li></ul><ul><li>Raging Search </li></ul><ul><li>Google </li></ul><ul><li>Dmoz.org </li></ul><ul><li>Yahoo! </li></ul><ul><li>Infoseek Internet </li></ul><ul><li>Lycos </li></ul><ul><li>Excite </li></ul>
    • 8. FTP Gopher HTML Crawling Indexing Search Local data Monitor Mine Modify Web Servers Web Browsers Social Network of Hyperlinks Relevance Ranking Latent Semantic Indexing Topic Directories More structure Clustering Scatter- Gather Semi-supervised Learning Automatic Classification Web Communities Topic Distillation Focused Crawling User Profiling Collaborative Filtering WebSQL WebL XML
    • 9. Roadmap <ul><li>Basic indexing and search </li></ul><ul><li>Measures of similarity </li></ul><ul><li>Unsupervised learning or clustering </li></ul><ul><li>Supervised learning or classification </li></ul><ul><li>Semi-supervised learning </li></ul><ul><li>Analyzing hyperlink structure </li></ul><ul><li>Systems issues </li></ul><ul><li>Resources and references </li></ul>
    • 10. Basic indexing and search
    • 11. Keyword indexing <ul><li>Boolean search </li></ul><ul><ul><li>care AND NOT old </li></ul></ul><ul><li>Stemming </li></ul><ul><ul><li>gain* </li></ul></ul><ul><li>Phrases and proximity </li></ul><ul><ul><li>“ new care” </li></ul></ul><ul><ul><li>loss NEAR/5 care </li></ul></ul><ul><ul><li>&lt;SENTENCE&gt; </li></ul></ul>My 0 care 1 is loss of care with old care done Your care is gain of care with new care won D1 D2 care D1: 1, 5, 8 D2: 1, 5, 8 new D2: 7 old D1: 7 loss D1: 3
    • 12. Tables and queries POSTING select distinct did from POSTING where tid = ‘care’ except select distinct did from POSTING where tid like ‘gain%’ with TPOS1(did, pos) as (select did, pos from POSTING where tid = ‘new’), TPOS2(did, pos) as (select did, pos from POSTING where tid = ‘care’) select distinct did from TPOS1, TPOS2 where TPOS1.did = TPOS2.did and proximity (TPOS1.pos, TPOS2.pos) proximity (a, b) ::= a + 1 = b abs(a - b) &lt; 5
    • 13. Issues <ul><li>Space overhead </li></ul><ul><ul><li>5…15% without position information </li></ul></ul><ul><ul><li>30…50% to support proximity search </li></ul></ul><ul><ul><li>Content-based clustering and delta-encoding of document and term ID can reduce space </li></ul></ul><ul><li>Updates </li></ul><ul><ul><li>Complex for compressed index </li></ul></ul><ul><ul><li>Global statistics decide ranking </li></ul></ul><ul><ul><li>Typically batch updates with ping-pong </li></ul></ul>
    • 14. Relevance ranking <ul><li>Recall = coverage </li></ul><ul><ul><li>What fraction of relevant documents were reported </li></ul></ul><ul><li>Precision = accuracy </li></ul><ul><ul><li>What fraction of reported documents were relevant </li></ul></ul><ul><li>Trade-off </li></ul><ul><li>‘Query’ generalizes to ‘topic’ </li></ul>Query “ True response” Search Output sequence Consider prefix k Compare
    • 15. Vector space model and TFIDF <ul><li>Some words are more important than others </li></ul><ul><li>W.r.t. a document collection D </li></ul><ul><ul><li>d + have a term, d - do not </li></ul></ul><ul><ul><li>“Inverse document frequency” </li></ul></ul><ul><li>“Term frequency” (TF) </li></ul><ul><ul><li>Many variants: </li></ul></ul><ul><li>Probabilistic models </li></ul>
    • 16. ‘Iceberg’ queries <ul><li>Given a query </li></ul><ul><ul><li>For all pages in the database computer similarity between query and page </li></ul></ul><ul><ul><li>Report 10 most similar pages </li></ul></ul><ul><li>Ideally, computation and IO effort should be related to output size </li></ul><ul><ul><li>Inverted index with AND may violate this </li></ul></ul><ul><li>Similar issues arise in clustering and classification </li></ul>
    • 17. Similarity and clustering
    • 18. Clustering <ul><li>Given an unlabeled collection of documents, induce a taxonomy based on similarity (such as Yahoo) </li></ul><ul><li>Need document similarity measure </li></ul><ul><ul><li>Represent documents by TFIDF vectors </li></ul></ul><ul><ul><li>Distance between document vectors </li></ul></ul><ul><ul><li>Cosine of angle between document vectors </li></ul></ul><ul><li>Issues </li></ul><ul><ul><li>Large number of noisy dimensions </li></ul></ul><ul><ul><li>Notion of noise is application dependent </li></ul></ul>
    • 19. <ul><li>Vocabulary V , term w i , document  represented by </li></ul><ul><li>is the number of times w i occurs in document  </li></ul><ul><li>Most f ’s are zeroes for a single document </li></ul><ul><li>Monotone component-wise damping function g such as log or square-root </li></ul>Document model
    • 20. Similarity Normalized document profile: Profile for document group  :
    • 21. Top-down clustering <ul><li>k -Means: Repeat… </li></ul><ul><ul><li>Choose k arbitrary ‘centroids’ </li></ul></ul><ul><ul><li>Assign each document to nearest centroid </li></ul></ul><ul><ul><li>Recompute centroids </li></ul></ul><ul><li>Expectation maximization (EM): </li></ul><ul><ul><li>Pick k arbitrary ‘distributions’ </li></ul></ul><ul><ul><li>Repeat: </li></ul></ul><ul><ul><ul><li>Find probability that document d is generated from distribution f for all d and f </li></ul></ul></ul><ul><ul><ul><li>Estimate distribution parameters from weighted contribution of documents </li></ul></ul></ul>
    • 22. Bottom-up clustering <ul><li>Initially G is a collection of singleton groups, each with one document </li></ul><ul><li>Repeat </li></ul><ul><ul><li>Find  ,  in G with max s (  ) </li></ul></ul><ul><ul><li>Merge group  with group  </li></ul></ul><ul><li>For each  keep track of best  </li></ul><ul><li>O ( n 2 log n ) algorithm with n 2 space </li></ul>
    • 23. Updating group average profiles Un-normalized group profile: Can show:
    • 24. “Rectangular time” algorithm <ul><li>Quadratic time is too slow </li></ul><ul><li>Randomly sample documents </li></ul><ul><li>Run group average clustering algorithm to reduce to k groups or clusters </li></ul><ul><li>Iterate assign-to-nearest O (1) times </li></ul><ul><ul><li>Move each document to nearest cluster </li></ul></ul><ul><ul><li>Recompute cluster centroids </li></ul></ul><ul><li>Total time taken is O ( kn ) </li></ul><ul><li>Non-deterministic behavior </li></ul>
    • 25. Issues <ul><li>Detecting noise dimensions </li></ul><ul><ul><li>Bottom-up dimension composition too slow </li></ul></ul><ul><ul><li>Definition of noise depends on application </li></ul></ul><ul><li>Running time </li></ul><ul><ul><li>Distance computation dominates </li></ul></ul><ul><ul><li>Random projections </li></ul></ul><ul><ul><li>Sublinear time w/o losing small clusters </li></ul></ul><ul><li>Integrating semi-structured information </li></ul><ul><ul><li>Hyperlinks, tags embed similarity clues </li></ul></ul><ul><ul><li>A link is worth a  ?  words </li></ul></ul>
    • 26. Random projection <ul><li>Johnson-Lindenstrauss lemma: </li></ul><ul><ul><li>Given a set of points in n dimensions </li></ul></ul><ul><ul><li>Pick a randomly oriented k dimensional subspace, k in a suitable range </li></ul></ul><ul><ul><li>Project points on to subspace </li></ul></ul><ul><ul><li>Inter-point distance is preserved w.h.p. </li></ul></ul><ul><li>Preserve sparseness in practice by </li></ul><ul><ul><li>Sampling original points uniformly </li></ul></ul><ul><ul><li>Pre-clustering and choosing cluster centers </li></ul></ul><ul><ul><li>Projecting other points to center vectors </li></ul></ul>
    • 27. Extended similarity <ul><li>Where can I fix my scooter ? </li></ul><ul><li>A great garage to repair your 2-wheeler is at … </li></ul><ul><li>auto and car co-occur often </li></ul><ul><li>Documents having related words are related </li></ul><ul><li>Useful for search and clustering </li></ul><ul><li>Two basic approaches </li></ul><ul><ul><li>Hand-made thesaurus (WordNet) </li></ul></ul><ul><ul><li>Co-occurrence and associations </li></ul></ul>… car … … auto … … auto …car … car … auto … auto …car … car … auto … auto …car … car … auto car  auto 
    • 28. Latent semantic indexing A Documents Terms U d t r D V d SVD Term Document car auto k k-dim vector
    • 29. LSI summary <ul><li>SVD factorization applied to term-by-document matrix </li></ul><ul><li>Singular values with largest magnitude retained </li></ul><ul><li>Linear transformation induced on terms and documents </li></ul><ul><li>Documents preprocessed and stored as LSI vectors </li></ul><ul><li>Query transformed at run-time and best documents fetched </li></ul>
    • 30. Collaborative recommendation <ul><li>People=record, movies=features </li></ul><ul><li>People and features to be clustered </li></ul><ul><ul><li>Mutual reinforcement of similarity </li></ul></ul><ul><li>Need advanced models </li></ul>From Clustering methods in collaborative filtering, by Ungar and Foster
    • 31. A model for collaboration <ul><li>People and movies belong to unknown classes </li></ul><ul><li>P k = probability a random person is in class k </li></ul><ul><li>P l = probability a random movie is in class l </li></ul><ul><li>P kl = probability of a class- k person liking a class- l movie </li></ul><ul><li>Gibbs sampling: iterate </li></ul><ul><ul><li>Pick a person or movie at random and assign to a class with probability proportional to P k or P l </li></ul></ul><ul><ul><li>Estimate new parameters </li></ul></ul>
    • 32. Supervised learning
    • 33. Supervised learning (classification) <ul><li>Many forms </li></ul><ul><ul><li>Content: automatically organize the web per Yahoo! </li></ul></ul><ul><ul><li>Type: faculty, student, staff </li></ul></ul><ul><ul><li>Intent: education, discussion, comparison, advertisement </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Relevance feedback for re-scoring query responses </li></ul></ul><ul><ul><li>Filtering news, email, etc. </li></ul></ul><ul><ul><li>Narrowing searches and selective data acquisition </li></ul></ul>
    • 34. Nearest neighbor classifier <ul><li>Build an inverted index of training documents </li></ul><ul><li>Find k documents having the largest TFIDF similarity with test document </li></ul><ul><li>Use (weighted) majority votes from training document classes to classify test document </li></ul>document the mining ?
    • 35. Difficulties <ul><li>Context-dependent noise (taxonomy) </li></ul><ul><ul><li>‘ Can’ (v.) considered a ‘stopword’ </li></ul></ul><ul><ul><li>‘ Can’ (n.) may not be a stopword in /Yahoo/SocietyCulture/Environment/ Recycling </li></ul></ul><ul><li>Dimensionality </li></ul><ul><ul><li>Decision tree classifiers: dozens of columns </li></ul></ul><ul><ul><li>Vector space model: 50,000 ‘columns’ </li></ul></ul><ul><ul><li>Computational limits force independence assumptions; leads to poor accuracy </li></ul></ul>
    • 36. Techniques <ul><li>Nearest neighbor </li></ul><ul><ul><li>Standard keyword index also supports classification </li></ul></ul><ul><ul><li>How to define similarity? (TFIDF may not work) </li></ul></ul><ul><ul><li>Wastes space by storing individual document info </li></ul></ul><ul><li>Rule-based, decision-tree based </li></ul><ul><ul><li>Very slow to train (but quick to test) </li></ul></ul><ul><ul><li>Good accuracy (but brittle rules tend to overfit) </li></ul></ul><ul><li>Model-based </li></ul><ul><ul><li>Fast training and testing with small footprint </li></ul></ul><ul><li>Separator-based </li></ul><ul><ul><li>Support Vector Machines </li></ul></ul>
    • 37. Document generation models <ul><li>Boolean vector (word counts ignored) </li></ul><ul><ul><li>Toss one coin for each term in the universe </li></ul></ul><ul><li>Bag of words (multinomial) </li></ul><ul><ul><li>Toss coin with a term on each face </li></ul></ul><ul><li>Limited dependence models </li></ul><ul><ul><li>Bayesian network where each feature has at most k features as parents </li></ul></ul><ul><ul><li>Maximum entropy estimation </li></ul></ul><ul><li>Limited memory models </li></ul><ul><ul><li>Markov models </li></ul></ul>
    • 38. Binary (boolean vector) <ul><li>Let vocabulary size be | T | </li></ul><ul><li>Each document is a vector of length | T | </li></ul><ul><ul><li>One slot for each term </li></ul></ul><ul><li>Each slot t has an associated coin with head probability  t </li></ul><ul><li>Slots are turned on and off independently by tossing the coins </li></ul>
    • 39. Multinomial (bag-of-words) <ul><li>Decide topic; topic c is picked with prior probability  ( c );  c  ( c ) = 1 </li></ul><ul><li>Each topic c has parameters  ( c , t ) for terms t </li></ul><ul><li>Coin with face probabilities  t  ( c , t ) = 1 </li></ul><ul><li>Fix document length  </li></ul><ul><li>Toss coin  times, once for each word </li></ul><ul><li>Given  and c , probability of document </li></ul>
    • 40. Limitations <ul><li>With the term distribution </li></ul><ul><ul><li>100th occurrence is as surprising as first </li></ul></ul><ul><ul><li>No inter-term dependence </li></ul></ul><ul><li>With using the model </li></ul><ul><ul><li>Most observed  ( c , t ) are zero and/or noisy </li></ul></ul><ul><ul><li>Have to pick a low-noise subset of the term universe </li></ul></ul><ul><ul><li>Have to “fix” low-support statistics </li></ul></ul><ul><ul><ul><li>Smoothing and discretization </li></ul></ul></ul><ul><ul><ul><li>Coin turned up heads 100/100 times; what is Pr(tail) on the next toss? </li></ul></ul></ul>
    • 41. Feature selection p 1 p 2 ... q 1 q 2 ... T T N Model with unknown parameters Observed data 0 1 ... N p 1 q 1 Confidence intervals Pick F  T such that models built over F have high separation confidence
    • 42. Effect of feature selection <ul><li>Sharp knee in error with small number of features </li></ul><ul><li>Saves class model space </li></ul><ul><ul><li>Easier to hold in memory </li></ul></ul><ul><ul><li>Faster classification </li></ul></ul><ul><li>Mild increase in error beyond knee </li></ul><ul><ul><li>Worse for binary model </li></ul></ul>
    • 43. Effect of parameter smoothing <ul><li>Multinomial known to be more accurate than binary under Laplace smoothing </li></ul><ul><li>Better marginal distribution model compensates for modeling term counts! </li></ul><ul><li>Good parameter smoothing is critical </li></ul>
    • 44. Support vector machines (SVM) <ul><li>No assumptions on data distribution </li></ul><ul><li>Goal is to find separators </li></ul><ul><li>Large bands around separators give better generalization </li></ul><ul><li>Quadratic programming </li></ul><ul><li>Efficient heuristics </li></ul><ul><li>Best known results </li></ul>
    • 45. <ul><li>Observations ( d i , c i ), i = 1… N </li></ul><ul><li>Want model p ( c | d ), expressed using features f i ( c , d ) and parameters  j as </li></ul><ul><li>Constraints given by observed data </li></ul><ul><li>Objective is to maximize entropy of p </li></ul><ul><li>Features </li></ul><ul><ul><li>Numerical non-linear optimization </li></ul></ul><ul><ul><li>No naïve independence assumptions </li></ul></ul>Maximum entropy classifiers
    • 46. Semi-supervised learning
    • 47. Exploiting unlabeled documents <ul><li>Unlabeled documents are plentiful; labeling is laborious </li></ul><ul><li>Let training documents belong to classes in a graded manner Pr( c | d ) </li></ul><ul><li>Initially labeled documents have 0/1 membership </li></ul><ul><li>Repeat (Expectation Maximization ‘EM’): </li></ul><ul><ul><li>Update class model parameters  </li></ul></ul><ul><ul><li>Update membership probabilities Pr( c | d ) </li></ul></ul><ul><li>Small labeled set  large accuracy boost </li></ul>
    • 48. Clustering categorical data <ul><li>Example: Web pages bookmarked by many users into multiple folders </li></ul><ul><li>Two relations </li></ul><ul><ul><li>Occurs_in(term, document) </li></ul></ul><ul><ul><li>Belongs_to(document, folder) </li></ul></ul><ul><li>Goal: cluster the documents so that original folders can be expressed as simple union of clusters </li></ul><ul><li>Application: user profiling, collaborative recommendation </li></ul>
    • 49. Bookmarks clustering <ul><li>Unclear how to embed in a geometry </li></ul><ul><ul><li>A folder is worth __?__ words? </li></ul></ul><ul><li>Similarity clues: document-folder cocitation and term sharing across folders </li></ul>Entertainment Studios Broadcasting Media kpfa.org bbc.co.uk kron.com channel4.com kcbs.com foxmovies.com lucasfilms.com miramax.com Share document Share folder Share terms Themes ‘ Radio’ ‘ Television’ ‘ Movies’
    • 50. Analyzing hyperlink structure
    • 51. Hyperlink graph analysis <ul><li>Hypermedia is a social network </li></ul><ul><ul><li>Telephoned, advised, co-authored, paid </li></ul></ul><ul><li>Social network theory (cf. Wasserman &amp; Faust) </li></ul><ul><ul><li>Extensive research applying graph notions </li></ul></ul><ul><ul><li>Centrality and prestige </li></ul></ul><ul><ul><li>Co-citation (relevance judgment) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Web search: HITS, Google, CLEVER </li></ul></ul><ul><ul><li>Classification and topic distillation </li></ul></ul>
    • 52. Hypertext models for classification <ul><li>c =class, t =text, N =neighbors </li></ul><ul><li>Text-only model: Pr[ t | c ] </li></ul><ul><li>Using neighbors’ text to judge my topic: Pr[ t , t ( N ) | c ] </li></ul><ul><li>Better model: Pr[ t , c ( N ) | c ] </li></ul><ul><li>Non-linear relaxation </li></ul>?
    • 53. Exploiting link features <ul><li>9600 patents from 12 classes marked by USPTO </li></ul><ul><li>Patents have text and cite other patents </li></ul><ul><li>Expand test patent to include neighborhood </li></ul><ul><li>‘Forget’ fraction of neighbors’ classes </li></ul>
    • 54. Co-training <ul><li>Divide features into two class-conditionally independent sets </li></ul><ul><li>Use labeled data to induce two separate classifiers </li></ul><ul><li>Repeat: </li></ul><ul><ul><li>Each classifier is “most confident” about some unlabeled instances </li></ul></ul><ul><ul><li>These are labeled and added to the training set of the other classifier </li></ul></ul><ul><li>Improvements for text + hyperlinks </li></ul>
    • 55. Ranking by popularity <ul><li>In-degree  prestige </li></ul><ul><li>Not all votes are worth the same </li></ul><ul><li>Prestige of a page is the sum of prestige of citing pages: p = Ep </li></ul><ul><li>Pre-compute query independent prestige score </li></ul><ul><li>Google model </li></ul><ul><li>High prestige  good authority </li></ul><ul><li>High reflected prestige  good hub </li></ul><ul><li>Bipartite iteration </li></ul><ul><ul><li>a = Eh </li></ul></ul><ul><ul><li>h = E T a </li></ul></ul><ul><ul><li>h = E T Eh </li></ul></ul><ul><li>HITS/Clever model </li></ul>
    • 56. Tables and queries HUBS AUTH LINK update LINK as X set (wtfwd) = 1. / (select count(ipsrc) from LINK where ipsrc = X.ipsrc and urldst = X.urldst) where type = non-local ; delete from HUBS; insert into HUBS(url, score) (select urlsrc, sum(score * wtrev) from AUTH, LINK where authwt is not null and type = non-local and ipdst &lt;&gt; ipsrc and url = urldst group by urlsrc); update HUBS set (score) = score / (select sum(score) from HUBS); urlsrc @ipsrc urldst @ipdst wgtfwd wgtrev score score
    • 57. Topical locality on the Web <ul><li>Sample sequence of out-links from pages </li></ul><ul><li>Classify out-links </li></ul><ul><li>See if class is same as that at offset zero </li></ul><ul><li>TFIDF similarity across endpoint of a link is very large compared to random page-pairs </li></ul>
    • 58. Resource discovery Taxonomy Database Taxonomy Editor Example Browser Crawl Database Hypertext Classifier (Learn) Topic Models Hypertext Classifier (Apply) Topic Distiller Feedback Scheduler Workers Crawler
    • 59. Resource discovery results <ul><li>High rate of “harvesting” relevant pages </li></ul><ul><li>Robust to perturbations of starting URLs </li></ul><ul><li>Great resources found 12 links from start set </li></ul>
    • 60. Systems issues
    • 61. Data capture <ul><li>Early hypermedia visions </li></ul><ul><ul><li>Xanadu (Nelson), Memex (Bush) </li></ul></ul><ul><ul><li>Text, links, browsing and searching actions </li></ul></ul><ul><li>Web as hypermedia </li></ul><ul><ul><li>Text and link support is reasonable </li></ul></ul><ul><ul><ul><li>Autonomy leads to some anarchy </li></ul></ul></ul><ul><ul><li>Architecture for capturing user behavior </li></ul></ul><ul><ul><ul><li>No single standard </li></ul></ul></ul><ul><ul><ul><li>Applications too nascent and diverse </li></ul></ul></ul><ul><ul><ul><li>Privacy concerns </li></ul></ul></ul>
    • 62. Storage, indexing, query processing <ul><li>Storage of XML objects in RDBMS is being intensively researched </li></ul><ul><li>Documents have unstructured fields too </li></ul><ul><li>Space- and update-efficient string index </li></ul><ul><ul><li>Indices in Oracle8i exceed 10x raw text </li></ul></ul><ul><li>Approximate queries over text </li></ul><ul><li>Combining string queries with structure queries </li></ul><ul><li>Handling hierarchies efficiently </li></ul>
    • 63. Concurrency and recovery <ul><li>Strong RDBMS features </li></ul><ul><ul><li>Useful in medium-sized crawlers </li></ul></ul><ul><li>Not sufficiently flexible </li></ul><ul><ul><li>Unlogged tables, columns </li></ul></ul><ul><ul><li>Lazy indices and concurrent work queues </li></ul></ul><ul><li>Advances query processing </li></ul><ul><ul><li>Index (-ed scans) over temporary table expressions; multi-query optimization </li></ul></ul><ul><ul><li>Answering complex queries approximately </li></ul></ul>
    • 64. Resources
    • 65. Research areas <ul><li>Modeling, representation, and manipulation </li></ul><ul><li>Approximate structure and content matching </li></ul><ul><li>Answering questions in specific domains </li></ul><ul><li>Language representation </li></ul><ul><li>Interactive refinement of ill-defined queries </li></ul><ul><li>Tracking emergent topics in a newsgroup </li></ul><ul><li>Content-based collaborative recommendation </li></ul><ul><li>Semantic prefetching and caching </li></ul>
    • 66. Events and activities <ul><li>Text REtrieval Conference (TREC) </li></ul><ul><ul><li>Mature ad-hoc query and filtering tracks </li></ul></ul><ul><ul><li>New track for web search (2…100GB corpus) </li></ul></ul><ul><ul><li>New track for question answering </li></ul></ul><ul><li>Internet Archive </li></ul><ul><ul><li>Accounts with access to large Web crawls </li></ul></ul><ul><li>DIMACS special years on Networks (-2000) </li></ul><ul><ul><li>Includes applications such as information retrieval, databases and the Web, multimedia transmission and coding, distributed and collaborative computing </li></ul></ul><ul><li>Conferences: WWW, SIGIR, KDD, ICML, AAAI </li></ul>

    ×