Intelligent Methods in Models of Text Information Retrieval: Implications for Society <ul><li>I International Conference o...
Donald H. Kraft [1] Distinguished Visiting Professor Department of Computer Science U.S. Air Force Academy USAFA, CO  8084...
Societal Implications <ul><li>Computing </li></ul><ul><ul><li>Inclusiveness (gender, ethnic) </li></ul></ul><ul><ul><li>Di...
<ul><li>Security </li></ul><ul><li>Access </li></ul><ul><ul><li>Filtering </li></ul></ul><ul><ul><ul><li>Censoring Violenc...
Adversarial IR <ul><ul><li>Search engine spam (Spamindexing) </li></ul></ul><ul><ul><ul><li>attempt at exploiting a rankin...
<ul><ul><li>Link spam </li></ul></ul><ul><ul><ul><li>Link farms (community of pages citing each other) </li></ul></ul></ul...
<ul><ul><li>Advertisement Blocking </li></ul></ul><ul><ul><li>Web Content Filtering </li></ul></ul><ul><ul><li>Other Metho...
Information Retrieval Repository/Testbed <ul><li>A collaborative research project at Louisiana State University </li></ul>...
Phase I:  Constructing an IR System  Repository (Testbed) <ul><li>Constructing an IR system Repository (Testbed) </li></ul...
Phase II:  Developing a Web-based  front-end system <ul><li>User inputs for design </li></ul><ul><ul><ul><li>User-process ...
Phase III:  Identifying appropriate user-centered evaluation criteria <ul><li>Identifying appropriate user-centered evalua...
Some Artificial Intelligence Techniques We Have Used For Retrieval <ul><ul><li>Fuzzy Sets </li></ul></ul><ul><ul><li>Genet...
Generalization of the Boolean Model - Adding Imprecision (Fuzzy) <ul><li>F and a map to [0,1] rather than to {0,1} </li></...
Generalizations - Adding Boolean Logic <ul><li>Vector Space Model </li></ul><ul><ul><li>query t AND t’ </li></ul></ul><ul>...
Relevance Feedback <ul><li>Modify, usually the query, based on relevance judgments of the user after seeing some retrieved...
Boolean Relevance Feedback <ul><li>Common Sense </li></ul><ul><ul><li>Increase Recall - use OR, use general terms, search ...
Genetic Algorithm (Programming) <ul><li>Initial generation - initial population of queries (original query, terms in origi...
Fuzzy and Rough Sets for Data Mining of a Controlled Vocabulary <ul><li>U = universe of objects, e.g., a set of terms from...
Term relationships <ul><li>Term Relationships </li></ul><ul><ul><li>Fuzzy relationships - Fuzzy thesauri and inference </l...
<ul><li>Sim R (D,Q) = 1-|  apr R (Q) -  apr R (Q)     apr R (D)|/| apr R (Q)| </li></ul><ul><li>Sim R (D,Q) = 1-|  apr R ...
<ul><li>Rough Fuzzy Sets </li></ul><ul><ul><li>Crisp partition, fuzzy documents and query </li></ul></ul><ul><ul><ul><li>a...
Extensions <ul><li>Note that synonym, sibling, and related term relationships are partitions </li></ul><ul><li>Note that a...
Fuzzy Clustering <ul><li>Clustering - putting like things together </li></ul><ul><li>Documents - term weight vector, simil...
Fuzzy Rule Discovery <ul><li>Consider documents in a cluster and look at the term weights in the centroid of that cluster ...
<ul><li>Normalize cluster centroid v k </li></ul><ul><li>Sort the terms in descending order of weights in centroid and con...
<ul><li>Use above rule to modify a query  q : </li></ul><ul><li>q’ = q with weight for term k modified to be w’ if weight ...
User Profiles Via Fuzzy Clustering <ul><li>User Profiles – “Average” User Description Based on Demographics and Past/ Curr...
<ul><li>The fuzzy cluster center vectors form a user profile </li></ul><ul><li>Relevant cluster centroids: V = {v 1 , ...,...
Some Issues - Document Duplication <ul><li>Problem </li></ul><ul><ul><li>Union of Documents from Multiple Sources (Search ...
<ul><li>Solutions </li></ul><ul><ul><li>Syntactic </li></ul></ul><ul><ul><ul><li>Hash-based Approaches </li></ul></ul></ul...
<ul><li>Resemblance  R(D 1 , D 2 ) =  |S(D 1 )  S(D 2 )|/|S(D 1 )  S(D 2 )|  </li></ul><ul><li>Calculate Resemblance of ...
Future Directions <ul><li>Nonprint media - images (content-based), sound (and music), multimedia </li></ul><ul><li>Digital...
<ul><li>Machine Learning </li></ul><ul><ul><li>Probability, Symbolic Learning and Rule Induction, Neural Networks, Evoluti...
¿ Preguntas?
Upcoming SlideShare
Loading in …5
×

Intelligent Methods in Models of Text Information Retrieval: Implications for Society

1,727 views

Published on

Donald Kraft's presentation at InSciT2006.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,727
On SlideShare
0
From Embeds
0
Number of Embeds
56
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Intelligent Methods in Models of Text Information Retrieval: Implications for Society

  1. 1. Intelligent Methods in Models of Text Information Retrieval: Implications for Society <ul><li>I International Conference on Multidisciplinary Information Sciences and Technologies (InSciT2006) </li></ul><ul><li>October 2006 Merida, Spain </li></ul>
  2. 2. Donald H. Kraft [1] Distinguished Visiting Professor Department of Computer Science U.S. Air Force Academy USAFA, CO 80840 USA [email_address] Department of Computer Science Louisiana State University Baton Rouge, LA 70803-4020 USA
  3. 3. Societal Implications <ul><li>Computing </li></ul><ul><ul><li>Inclusiveness (gender, ethnic) </li></ul></ul><ul><ul><li>Digital Divide </li></ul></ul><ul><li>Work </li></ul><ul><ul><li>Environment </li></ul></ul><ul><ul><li>Displacement of Labor </li></ul></ul><ul><li>Intellectual Property </li></ul><ul><ul><li>Copyright, Patent </li></ul></ul><ul><ul><ul><li>Software, Music </li></ul></ul></ul><ul><li>Privacy </li></ul>
  4. 4. <ul><li>Security </li></ul><ul><li>Access </li></ul><ul><ul><li>Filtering </li></ul></ul><ul><ul><ul><li>Censoring Violence, Hatred, Pornography </li></ul></ul></ul><ul><li>Blogs (web logs) and Politics </li></ul>
  5. 5. Adversarial IR <ul><ul><li>Search engine spam (Spamindexing) </li></ul></ul><ul><ul><ul><li>attempt at exploiting a ranking algorithm so that a web page may appear higher in certain search results (“optimization” – reverse engineering of ranking algorithm) </li></ul></ul></ul><ul><ul><ul><li>Content Spam </li></ul></ul></ul><ul><ul><ul><ul><li>Hidden or invisible text (same as background color) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Keyword stuffing (hidden, random text for frequency) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Meta tag stuffing (repeating keywords in tags) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Gateway or doorway pages (low quality pages with keywords) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Scraper Sites (scrape results, create content for pages with ads) </li></ul></ul></ul></ul>
  6. 6. <ul><ul><li>Link spam </li></ul></ul><ul><ul><ul><li>Link farms (community of pages citing each other) </li></ul></ul></ul><ul><ul><ul><ul><li>Exchanging reciprocal links to increase rank </li></ul></ul></ul></ul><ul><ul><ul><li>Hidden links (hiding links where users won’t see) </li></ul></ul></ul><ul><ul><ul><li>Sybil attack (multiple identities) </li></ul></ul></ul><ul><ul><ul><li>Spam blogs (splogs) </li></ul></ul></ul><ul><ul><ul><li>Page hijacking </li></ul></ul></ul><ul><ul><ul><li>Referer log spamming </li></ul></ul></ul><ul><ul><ul><li>Buying expired domains </li></ul></ul></ul><ul><ul><li>Link bombing (collection of pages with easily detectable methods of artificially increasing web ranking to try to get targeted site penalized or delisted) </li></ul></ul>
  7. 7. <ul><ul><li>Advertisement Blocking </li></ul></ul><ul><ul><li>Web Content Filtering </li></ul></ul><ul><ul><li>Other Methods </li></ul></ul><ul><ul><ul><li>Mirror websites (different URLs, same content) </li></ul></ul></ul><ul><ul><ul><li>Page Redirects (without user intervention – scripts, refresh tags, server side techniques) </li></ul></ul></ul><ul><ul><ul><li>Cloaking (different page seen by users than by spider) </li></ul></ul></ul><ul><ul><ul><li>Black hat spam – doorway pages, googleating </li></ul></ul></ul><ul><ul><ul><li>Google bomb </li></ul></ul></ul><ul><ul><ul><li>Crawling web without detection </li></ul></ul></ul><ul><ul><ul><li>Malicious tagging </li></ul></ul></ul>
  8. 8. Information Retrieval Repository/Testbed <ul><li>A collaborative research project at Louisiana State University </li></ul><ul><ul><ul><li>enhancing system usability through the application of a user-centered paradigm </li></ul></ul></ul><ul><ul><ul><li>benefit researchers in both the HCI and LIS communities by providing new design tools for information retrieval (IR) systems </li></ul></ul></ul><ul><li>Participants </li></ul><ul><ul><ul><li>Computer Science Department </li></ul></ul></ul><ul><ul><ul><ul><li>Donald H. Kraft </li></ul></ul></ul></ul><ul><ul><ul><li>School of Library and Information Science </li></ul></ul></ul><ul><ul><ul><ul><li>Boryun Ju and Lisl Zach </li></ul></ul></ul></ul>
  9. 9. Phase I: Constructing an IR System Repository (Testbed) <ul><li>Constructing an IR system Repository (Testbed) </li></ul><ul><ul><li>Equipment </li></ul></ul><ul><ul><li>Dataset </li></ul></ul><ul><ul><ul><li>TREC </li></ul></ul></ul><ul><ul><li>IRS </li></ul></ul><ul><ul><ul><li>Smart, Inquery, MG, … </li></ul></ul></ul><ul><ul><ul><li>Reverse Engineering </li></ul></ul></ul><ul><li>Provides access for researchers to test individual modifications without having to generate an entire system </li></ul>
  10. 10. Phase II: Developing a Web-based front-end system <ul><li>User inputs for design </li></ul><ul><ul><ul><li>User-process model </li></ul></ul></ul><ul><ul><ul><li>Prototype – Web-based interface </li></ul></ul></ul><ul><ul><ul><li>Assessing user performance </li></ul></ul></ul><ul><ul><li>Experiments: testing usability - experts vs. novices </li></ul></ul><ul><ul><li>Design, test, redesign; implement </li></ul></ul><ul><li>Target population </li></ul><ul><ul><li>HCI and LIS researchers and educators </li></ul></ul><ul><ul><li>MIS professionals </li></ul></ul><ul><ul><li>Other information stake holders </li></ul></ul>
  11. 11. Phase III: Identifying appropriate user-centered evaluation criteria <ul><li>Identifying appropriate user-centered evaluation criteria </li></ul><ul><ul><li>Focuses on users and their interactions with systems rather than on retrieval algorithms </li></ul></ul><ul><ul><li>Investigates criteria applied by users when evaluating retrieval results </li></ul></ul><ul><ul><li>Probes users’ decisions to stop looking for information </li></ul></ul>
  12. 12. Some Artificial Intelligence Techniques We Have Used For Retrieval <ul><ul><li>Fuzzy Sets </li></ul></ul><ul><ul><li>Genetic Algorithms </li></ul></ul><ul><ul><li>Rough Sets </li></ul></ul>
  13. 13. Generalization of the Boolean Model - Adding Imprecision (Fuzzy) <ul><li>F and a map to [0,1] rather than to {0,1} </li></ul><ul><li>Imprecision (Fuzzy) </li></ul><ul><ul><li>Document representation </li></ul></ul><ul><ul><li>Query representation </li></ul></ul><ul><ul><li>Matching </li></ul></ul><ul><ul><li>Relevance </li></ul></ul><ul><ul><li>Ranking </li></ul></ul><ul><li>Fuzzy  Probability </li></ul><ul><ul><li>Possibility </li></ul></ul>
  14. 14. Generalizations - Adding Boolean Logic <ul><li>Vector Space Model </li></ul><ul><ul><li>query t AND t’ </li></ul></ul><ul><ul><ul><li>Sim(d,q) = 1-{[a qt p (1-F dt ) p +a qt’ p (1-F dt’ ) p ]/[a qt p +a qt’ p ]} (1/p) </li></ul></ul></ul><ul><ul><li>query t OR t’ </li></ul></ul><ul><ul><ul><li>Sim(d,q) = {[a qt p *F dt p +a qt’ p* F dt’ p ]/[a qt p +a qt’ p ]} (1/p) </li></ul></ul></ul><ul><li>Probability Model </li></ul><ul><ul><li>query t AND t’ </li></ul></ul><ul><ul><ul><li>Pr(d relevant| t and t’)=1-(1-Pr(d relevant | t)) * (1-Pr(d relevant | t’)) </li></ul></ul></ul><ul><ul><li>query t OR t’ </li></ul></ul><ul><ul><ul><li>Pr(d relevant| t and t’)=Pr(d relevant | t) * Pr(d relevant | t’) </li></ul></ul></ul>
  15. 15. Relevance Feedback <ul><li>Modify, usually the query, based on relevance judgments of the user after seeing some retrieved documents </li></ul><ul><li>Vector Space Model - move query closer to relevant documents and away from nonrelevant documents, q’ = q +  i  R d i +(1-  )  i  R’ d i (R = relevant, retrieved set, R’ = nonrelevant retrieved set) </li></ul><ul><li>Probability Model - Bayesian updating - q’(j) = weight for term j = [(r j /(r- r j )] /[(n j -r j )/({n- n j }-{r- r j })] where n = number of retrieved documents, r = number of relevant documents retrieved, n j = number of retrieved documents with term j, and r j = number of relevant documents with term j retrieved </li></ul>
  16. 16. Boolean Relevance Feedback <ul><li>Common Sense </li></ul><ul><ul><li>Increase Recall - use OR, use general terms, search entire document, use wild card truncation </li></ul></ul><ul><ul><li>Increase Precision - use AND, use specific terms, search specific fields, use delimiters (language, author, source, date), use proximity operators </li></ul></ul><ul><li>Fuzzy Boolean Queries </li></ul><ul><ul><li>Search entire space of possible queries - Satisfiability </li></ul></ul><ul><ul><ul><li>linear search, discrete optimization, tabu search, simulated annealing, ... </li></ul></ul></ul>
  17. 17. Genetic Algorithm (Programming) <ul><li>Initial generation - initial population of queries (original query, terms in original query in easy combinations, ...) </li></ul><ul><li>Evaluate individuals in population - fitness function (relevance and precision for known items) </li></ul><ul><li>Generate new queries by combining randomly selected (based on fitness) old ones - crossover </li></ul><ul><ul><li>query represented as parse tree - exchange subtrees </li></ul></ul><ul><li>Possibly modify some queries via mutation (change weights, negate terms, change AND and OR operators) </li></ul><ul><li>Continue until evaluation stable, or evaluation satisfactory, or … </li></ul><ul><li>Niching </li></ul>
  18. 18. Fuzzy and Rough Sets for Data Mining of a Controlled Vocabulary <ul><li>U = universe of objects, e.g., a set of terms from a controlled vocabulary </li></ul><ul><li>R = partition, i.e., a reflexive, symmetric, and transitive (also serial and Euclidean) relationship </li></ul><ul><li>R(x,x) = 1, R(x,y)=R(y,x), R(x,y)^R(y,z)  R(x,z) </li></ul><ul><li>R foreset, e.g., synonymy </li></ul><ul><li>U/R = set of n partition cells (equivalence classes) = {C 1 ,C 2 ,…,C n }, e.g., synonyms </li></ul><ul><li>S = any subset of U, i.e., S  U, e.g., a document or query </li></ul><ul><li>apr R (S)={x  C i | C i  S} = lower bound approximation = </li></ul><ul><ul><li>set of partition cell elements where every element of the cell is in S </li></ul></ul><ul><li>apr R (S)={x  C i | C i  S  } = upper bound approximation = </li></ul><ul><ul><li>set of partition cell elements where at least one element of the cell is in S </li></ul></ul>
  19. 19. Term relationships <ul><li>Term Relationships </li></ul><ul><ul><li>Fuzzy relationships - Fuzzy thesauri and inference </li></ul></ul><ul><ul><li>Example from UMLS (medical) </li></ul></ul><ul><ul><ul><ul><li>1,2-dipalmitoylphosphatidycholine </li></ul></ul></ul></ul><ul><ul><ul><li>R1 - Synonym dipalmitoyllecithin </li></ul></ul></ul><ul><ul><ul><li>R2- Ancestor term phosphatidic acids; phospholipids </li></ul></ul></ul><ul><ul><ul><li>R3 - Parent term dimyristoylphosphatidylcholine </li></ul></ul></ul><ul><ul><ul><li>R4 - Sibling term phosphatidycholines </li></ul></ul></ul><ul><ul><ul><li>R5 - Qualifier term administration and dosage adverse effects; analogs and derivatives </li></ul></ul></ul><ul><ul><ul><li>R6 - Narrower term colfosceril palmitate </li></ul></ul></ul><ul><ul><ul><li>R7 - Related term dipalmitoylphosphatidyl; mass concentration; point in time; serum; quantitative </li></ul></ul></ul><ul><ul><ul><li>R8 - Co-occurring term acetephenones; acids; alcohols; laurates </li></ul></ul></ul>
  20. 20. <ul><li>Sim R (D,Q) = 1-| apr R (Q) - apr R (Q)  apr R (D)|/| apr R (Q)| </li></ul><ul><li>Sim R (D,Q) = 1-| apr R (Q) - apr R (Q)  apr R (D)|/| apr R (Q)| </li></ul><ul><li>Fuzzy set notation </li></ul><ul><ul><li> apr (R(S) (x) = inf{  S (y) | y  U, xRy} = inf{1-  R (x,y) | y  S} </li></ul></ul><ul><ul><li>= inf{max[  S (y), 1 -  R (x,y) | y  U} </li></ul></ul><ul><ul><li> apr (R(S) (x) = sup{  S (y) | y  U, xRy} = sup{  R (x,y) | y  S} </li></ul></ul><ul><ul><li>= sup{min[  S (y),  R (x,y) | y  U} </li></ul></ul><ul><ul><li>Note that |X| =  y  U  X (y) = cardinality of set X </li></ul></ul>
  21. 21. <ul><li>Rough Fuzzy Sets </li></ul><ul><ul><li>Crisp partition, fuzzy documents and query </li></ul></ul><ul><ul><ul><li>alpha cutoff </li></ul></ul></ul><ul><li>Fuzzy Rough Sets </li></ul><ul><ul><li>Fuzzy partition, crisp documents and query </li></ul></ul><ul><li>Generalized Fuzzy and Rough Sets </li></ul><ul><ul><li>Fuzzy partition, documents, and query </li></ul></ul><ul><li>Nonequivalence Relationships </li></ul><ul><ul><li>Hierarchy (weak ordering) </li></ul></ul><ul><ul><li>Isa - dog isa canine isa mammal … </li></ul></ul><ul><li>Combining Several Generalized Relationships Simultaneously </li></ul>
  22. 22. Extensions <ul><li>Note that synonym, sibling, and related term relationships are partitions </li></ul><ul><li>Note that ancestor, parent, and narrower term are partial orderings (isa, i.e., transitive but not relexive nor symmetric) </li></ul><ul><li>Note that qualifier and co-occurring term relationships are general without much lattice structure </li></ul><ul><li>Note – one could just naively use formulae without worrying about nature of relationship </li></ul>
  23. 23. Fuzzy Clustering <ul><li>Clustering - putting like things together </li></ul><ul><li>Documents - term weight vector, similarity via cosine (vector space model) of documents rather than document-query </li></ul><ul><ul><ul><li>Dimension of similarity calculation is number of terms in vocabulary </li></ul></ul></ul>
  24. 24. Fuzzy Rule Discovery <ul><li>Consider documents in a cluster and look at the term weights in the centroid of that cluster </li></ul><ul><li>Generate rules on the order of “the weight of term j exceeding a threshold w implies that term k must exceed a threshold w’,” i.e., </li></ul><ul><li>r: (t j  w)  (t k  w’) </li></ul><ul><li>Use these rules to expand the query by adding additional terms with weights to increase performance (precision) </li></ul>
  25. 25. <ul><li>Normalize cluster centroid v k </li></ul><ul><li>Sort the terms in descending order of weights in centroid and consider the first M (  2) terms. </li></ul><ul><li>Build term pairs from the chosen terms in each cluster center in the form of  [t j , w], [t k ,w ’ ]  </li></ul><ul><li>Merge multiple occurrence of the same pairs with different weights (use minimums) </li></ul><ul><li>Build two rules from each pair above: </li></ul><ul><li>[t j  w ]  [ t k  w’ ] and </li></ul><ul><li>[t k  w’ ]  [t j  w ] </li></ul><ul><li>Each fuzzy rule r is a well-formed formula in fuzzy logic (Kundu and Chen) </li></ul>
  26. 26. <ul><li>Use above rule to modify a query q : </li></ul><ul><li>q’ = q with weight for term k modified to be w’ if weight of term k  w’ but weight of term j  w </li></ul><ul><li>Repeatedly apply fuzzy rules until no more applicable rules can be found – note that the final modified query is independent of the order of rule applications </li></ul>
  27. 27. User Profiles Via Fuzzy Clustering <ul><li>User Profiles – “Average” User Description Based on Demographics and Past/ Current Searches </li></ul><ul><li>Using clustering web pages retrieved by users </li></ul><ul><li>Web pages are represented as a vector of term weights based on frequency (like ordinary text retrieval) </li></ul><ul><li>Apply fuzzy C-means to both “relevant“ and “nonrelevant: web pages labeled by a user via relevance feedback </li></ul><ul><li>Example: </li></ul><ul><li>Keyword search - &quot;yellowstone&quot; on the Web </li></ul><ul><li>Experiments with about 100 Web pages find 60% to be relevant </li></ul><ul><li>Relevant sites have terms with weights: </li></ul><ul><li><buffalo = 0.7, deer = 0.65, wolf = 0.45, ...> </li></ul><ul><li><geyser = 0.87, faithful = 0.65, spring = 0.37, ...> </li></ul><ul><li><reserve = 0.64, lodge = 0.74, bus = 0.31, ...> </li></ul><ul><li>Nonrelevant sites have terms with weights: </li></ul><ul><li><hike = 0.61, trail = 0.59, country = 0.74, ...> </li></ul><ul><li><flyfish = 0.82, boat = 0.54, lake = 0.45, …> </li></ul>
  28. 28. <ul><li>The fuzzy cluster center vectors form a user profile </li></ul><ul><li>Relevant cluster centroids: V = {v 1 , ..., v r } </li></ul><ul><li>Nonrelevant cluster centroids: V = {v’ 1 , ..., v’ r’ } </li></ul><ul><li>Prediction of a web site’s relevance using the using the profile - </li></ul><ul><li>Prediction accuracy ranges from 72% to 89% depending on training data size </li></ul>
  29. 29. Some Issues - Document Duplication <ul><li>Problem </li></ul><ul><ul><li>Union of Documents from Multiple Sources (Search Engines) </li></ul></ul><ul><ul><li>Affects both Effectiveness and Efficiency </li></ul></ul><ul><li>Types </li></ul><ul><ul><li>Syntactic (Textual Similarities) </li></ul></ul><ul><ul><li>Semantic (Meaning Similar Even If Not Syntactically So) </li></ul></ul>
  30. 30. <ul><li>Solutions </li></ul><ul><ul><li>Syntactic </li></ul></ul><ul><ul><ul><li>Hash-based Approaches </li></ul></ul></ul><ul><ul><ul><ul><li>Not resilient to small document representation changes </li></ul></ul></ul></ul><ul><ul><ul><li>IR Techniques (Similarity) </li></ul></ul></ul><ul><ul><ul><ul><li>Clustering – rank all documents with similar term vectors (equivalent weights mean duplicates) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Corresponding posting of entries retrieved – high I/O cost, slow </li></ul></ul></ul></ul>
  31. 31. <ul><li>Resemblance R(D 1 , D 2 ) = |S(D 1 )  S(D 2 )|/|S(D 1 )  S(D 2 )| </li></ul><ul><li>Calculate Resemblance of all document pairs with matching terms </li></ul><ul><li>Divide document into shingles (X contiguous terms) to create a unique hash </li></ul><ul><li>Recalculate Resemblance based on hashes rather than terms </li></ul><ul><li>Optimization – filter shingles (every k th one or combination of multiple shingles) </li></ul>
  32. 32. Future Directions <ul><li>Nonprint media - images (content-based), sound (and music), multimedia </li></ul><ul><li>Digital Libraries and Portals </li></ul><ul><li>Parallel/Distributed Retrieval Systems </li></ul><ul><li>Web retrieval </li></ul><ul><li>Natural Language Processing </li></ul><ul><ul><li>Summarization </li></ul></ul><ul><li>Cross-Language Retrieval </li></ul><ul><ul><li>English (the spirit is willing but the flesh is weak) </li></ul></ul><ul><ul><li>Altavista Babel Fish Spanish (el alcohol está dispuesto pero la carne es débil) </li></ul></ul><ul><ul><li>Alta Vista Babel Fish English again (the alcohol is arranged but the meat is weak) </li></ul></ul><ul><li>Filtering </li></ul><ul><ul><li>Recommender Systems </li></ul></ul>
  33. 33. <ul><li>Machine Learning </li></ul><ul><ul><li>Probability, Symbolic Learning and Rule Induction, Neural Networks, Evolutionary Computing, , Analytical Learning, … </li></ul></ul><ul><ul><li>Relevance Feedback, Information Extraction, Filtering, Recommender Systems, Text Classification and Clustering </li></ul></ul><ul><li>Text Mining for Web Documents </li></ul><ul><li>Web Structure Mining </li></ul><ul><ul><li>Hyperlink Structure </li></ul></ul><ul><li>Intelligent Spiders </li></ul><ul><li>Web Agents – Collaboration </li></ul><ul><li>Visualization </li></ul>
  34. 34. ¿ Preguntas?

×