Ir

553 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
553
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ir

  1. 1. Information Retrieval (IR) Mohammed Al-Mashraee Corporate Semantic Web (AG-CSW) Institute for Computer Science, Freie Universität Berlin almashraee@inf.fu-berlin.de http://www.inf.fu-berlin.de/groups/ag-csw/ AG Corporate Semantic Web Freie Universität Berlin http://www.inf.fu-berlin.de/groups/ag-csw/
  2. 2. Agenda  Introduction  Motivation  Data structures and general representations  IR Definition  IR Models  Set theoritic / Boolean  Weighting Methods  Algebric / Vector IR Evaluation AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 2
  3. 3. Introduction  IR System Document corpus Query String IR System Ranked Documents AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 1. Doc1 2. Doc2 3. Doc3 . . 3
  4. 4. Introduction - IR Tasks  Given • •  A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query. AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 4
  5. 5. Introduction  Motivation  These days we frequently think first of web search, but there are many other cases: • E-mail search • Searching your laptop • Corporate knowledge bases • Legal information retrieval AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 5
  6. 6. Introduction - Motivation  Unstructured (text) vs. structured (database) data in the mid-nineties AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 6
  7. 7. Introduction - Motivation  Unstructured (text) vs. structured (database) data today AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 7
  8. 8. Data structures and general representations
  9. 9. Data structures and representations [Almashraee, 2013] Data representation is a process of providing a good environment for data to be accessed and manipulated fast. There are different structures:     Database scheme structure. Semi-structured data. Semantic representation / RDF representation. Feature vector representation AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 9
  10. 10. Data structures and representations Database scheme structure  A structured data representation.  Provides a smooth access and manipulation to the data stored in its scheme, e.g., Oracle, MySQL, etc. Semi-structured data representation Can have direct storage manipulation to data, but limited querying ability, e.g., XML. AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 10
  11. 11. Data structures and representations  Semantic representation / RDF representation  Newly available and structured data representation.  It is usually used by Semantic Web technology applications to interpret their related information and store them in a triple (Subject, Predicate, and Object) format. AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 11
  12. 12. Data structures and representations Feature vector representation  The most common used representation in which some extracted features presented as a vector.  This representation allow different methods (such as, information retrieval, support vector machines, Nave Bayes, association rule mining, decision trees, hidden Markov models, maximum entropy models, etc.) to build useful models to solve related problems. AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 12
  13. 13. IR vs. databases: Structured vs unstructured data  Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 13
  14. 14. Unstructured data  Typically refers to free text  Allows o Keyword queries including operators o More sophisticated “concept” queries e.g., • find all web pages dealing with drug abuse  Classic model for searching text documents AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 14
  15. 15. More General  Structured vs. Unstructured data  Search vs. Discoveryvery AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 15
  16. 16. Definition [Manning et al. 2008] Information Retrieval (IR) Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 16
  17. 17. Representation of Documents [Paschke notes]  Set of terms T = {t1,…,tn}  Each document dj is represented as a vector of weighted term: dj=(w1,j,…,wn,j)  wi,j is a weight for the term ti in the document dj  Set of documents D  Similarity measure sim describes the similarity of a document to the query AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 17
  18. 18. IR Models
  19. 19. IR Models  Boolean models/Set theoretic  Vector space models (Statistical/Algebric)  Probabilitic models AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 19
  20. 20. Boolean models/ Set theoretic
  21. 21. Set theoretic / Boolean Retrieval  Documents represented as vector of index terms  True if term exist in document, false otherwise  Weight wi,j i.e. 0 or 1 (Boolean truth weight)  Interpreted as Boolean variables  Queries represented as Boolean expressions  Terms are queries  (q1 AND q2), (q1 OR q2), (: q1) are queries  Document is relevant, if the query expression and the document expression together are true  Similarity measure is also Boolean AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 21
  22. 22. Example.1: Set theoretic/Boolean Retrieval  T = („today“, „is“, „Monday“, „lecture“, „no“)  Documents d1:„today is Monday“, d2:„today is lecture“, d3:„Monday is lecture“ Today is Monday lecture no d1 1 1 0 0 d2 1 1 0 1 0 d3 q 1 0 1 1 1 0 is Monday AND lecture Today OR Monday d1 1 0 1 1 d2 1 0 1 0 d3 1 1 1 0 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ NOT lecture 22
  23. 23. Example.2: Set theoretic/Boolean Retrieval Binary incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 23
  24. 24. Example.2: Set theoretic/Boolean Retrieval Binary incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 |v|  A document is represented by a binary vector ϵ{0, 1}|v|  The size of the vectore depends on the size of the vocabulary (dictionary) AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 24
  25. 25. Weighting Schemes
  26. 26. Weighting Schemes  Weighting schemes are used to give a score for documents according to a particular quiry to rank the document returned: • • • • • Bag-of-words model Term frequency (tf) model Document frequency (df) Inverse document frequency (idf) Term frequency – Inverse document frequency (tf-idf) AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 26
  27. 27. Weighting Schemes  Term-document count matrices • Consider the number of occurrences of a term in a document: Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 0 0 0 0 Brutus 4 157 0 1 0 0 Caesar 232 227 0 2 1 1 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 5 5 1 worser 2 0 1 1 1 0 |v|  The number of occurrences of a term in a document is considered  A document is represented by a natural number vector N|V|  The size of the vectore depends on the size of the vocabulary (dictionary) AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 27
  28. 28. Bag of words model  Vector representation doesn’t consider the ordering of words in a document  E.g., John is quicker than Mary and Mary is quicker than John have the same vectors AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 28
  29. 29. Term Frequency (tf) weighting  The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.  tf is used to compute the query-document match scores.  Raw term frequency is not what we want: • A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. • But not 10 times more relevant.  Relevance does not increase proportionally with term frequency (relevance goes up but not linearly).  Frequency in IR denotes the count of a word in the document AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 29
  30. 30. Log-frequency weighting  The log frequency weight of term t in document d is: wt,d 1  log10 tft,d ,  0,  if tft,d  0 otherwise  0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.  Score for a document-query pair: sum over terms t in both query q and document d: Score   (1  log tft ,d ) tqd  The score is 0 if none of the query terms is present in the document. AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 30
  31. 31. Inverse Document Frequency (idf)      Another score used for ranking the matches of documents to a query Idea: Terms that appear in many different documents are less indicative of overall topic Rare terms are more informative than frequent terms. dft is the document frequency of t: the number of documents that contain t • dft is an inverse measure of the informativeness of t • dft  N We use log (N/dft) instead of N/dft to “dampen” the effect of idf. AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 31
  32. 32. Inverse Document Frequency (idf)  We define the idf (inverse document frequency) of t by idf t  log10 ( N/df t ) df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 32
  33. 33. Inverse Document Frequency (idf)  Example: • Suppose N = 1 million (total number of documents) idf t  log10 ( N/df t ) term dft calpurnia idft 1 6 animal 100 4 sunday 1,000 3 fly 10,000 2 under 100,000 1 1,000,000 0 the AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 33
  34. 34. Tf-idf Weighting  The tf-idf weight of a term is the product of its tf weight and its idf weight Score(q,d) = å t ÎqÇd tf.idft,d w t ,d  (1  log tft ,d )  log10 ( N / dft )  A term occurring frequently in the document but rarely in the rest of the collection is given high weight.  Experimentally, tf-idf has been found to work well and proved to be the Best known weighting scheme in information retrieval  Increases with the number of occurrences within a document  Increases with the rarity of the term in the collection AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 34
  35. 35. Tf-idf Weighting  Binary → count → weight matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 5.25 3.18 0 0 0 0.35 Brutus 1.21 6.1 0 1 0 0 Caesar 8.59 2.54 0 1.51 0.25 0 Calpurnia 0 1.54 0 0 0 0 Cleopatra 2.85 0 0 0 0 0 mercy 1.51 0 1.9 0.12 5.25 0.88 worser 1.37 0 0.11 4.15 0.25 1.95 |v| Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V| AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 35
  36. 36. Tf-idf Weighting  Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6 B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0 C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 36
  37. 37. Vector space model
  38. 38. Vector space model      Documents are represented as vectors Queries are also represented as vectors Now we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space T3 3 Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 5 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 D2 = 3T1 + 7T2 + T3 T2 7 T2 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ Q = 0T1 + 0T2 T12T3 + 2 3 T1 Q = 0T1 + 0T2 + 2T3 38
  39. 39. Vector space model How to measure the similarity between Di and Q? AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 39
  40. 40. Vector space model Similarity Measure  Documents are ranked according to their proximity (similarity) to the query in a given space.  A similarity measure is a function that computes the degree of similarity between two vectors. • Scalar Product (Inner Product) • Cosine measure AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 40
  41. 41. Vector space model Similarity measure Inner Product  Distance between the end points of the two vectors  Similarity between vectors for the document di and query q can be computed as the vector inner product (dot product): t sim(dj,q) = dj•q = w i 1  ij wiq where wij is the weight of term i in document j and wiq is the weight of term i in the query. Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3 Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 41
  42. 42. Vector space model Why Inner Product is not a good solution for vector similarities The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 42
  43. 43. Vector space model Similarity measure  Cosine measure  Compute the weight for D and Q  Normalize the length of vectors for D and Q  Compute the cosine similarity AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 43
  44. 44. Vector space model Cosine measure  Length normalization • A vector can be length normalized by dividing each of its components by its length • Long and short documents now have comparable weights AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 44
  45. 45. Vector space model Cosine measure AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 45
  46. 46. Vector space model Cosine measure  Cosine similarity Dot product Unit vectors       qd q d cos( q, d )         q d qd  V qi d i i 1  V 2 i 1 i q d i2 i1 V qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 46
  47. 47. Vector space model Cosine for length-normalized vectors Since vectors are normalized  Since vectors are normalized (lengthnormalized) vectors, cosine similarity is simply the dot product (or scalar product cos(q, d) = q · d = å qi di V i=1 for q, d length-normalized. 47 47
  48. 48. Sec. 6.3 Example Cosine similarity amongst 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? term affection SaS PaP WH 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting. 48
  49. 49. Sec. 6.3 3 documents example contd. Log frequency weighting term SaS PaP After length normalization WH term SaS PaP WH affection 3.06 2.76 2.30 affection 0.789 0.832 0.524 jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465 gossip 1.30 0 1.78 gossip 0.335 0 0.405 0 0 2.58 wuthering 0 0 0.588 wuthering cos(SaS,PaP) ≈ 0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 49
  50. 50. Evaluation
  51. 51. Entire document Relevant collection documents relevant irrelevant Precision and Recall retrieved & irrelevant Not retrieved & irrelevant retrieved & relevant not retrieved but relevant retrieved Retrieved documents not retrieved Meature Formula Precision TP / (TP + FP) Recall TP / (TP + FN) 51 51
  52. 52. Precision and Recall  Precision The ability to retrieve top-ranked documents that are mostly relevant.  Recall The ability of the search to find all of the relevant items in the corpus. 52 52
  53. 53. Precision and Recall  Example  Assume the following: • A database contains 80 records on a particular topic • A search was conducted on that topic and 60 records were retrieved. • Of the 60 records retrieved, 45 were relevant.  Calculate the precision and recall scores for the search  Using the designations above: • A = The number of relevant records retrieved, • B = The number of relevant records not retrieved, and • C = The number of irrelevant records retrieved.  In this example A = 45, B = 35 (80-45) and C = 15 (60-45).   Recall = (45 / (45 + 35)) * 100% => 45/80 * 100% = 56% Precision = (45 / (45 + 15)) * 100% => 45/60 * 100% = 75% AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 53
  54. 54. Thank You, questions! AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 54
  55. 55. References Manning, Christopher et al. Introduction to Information Retrieval, 2008. Baeza-Yates et al. Modern Information Retrieval, 1999. Adrian Paschke - „Web Based Information Systems Lecture notes“, FU Berlin. Rohit Kate- „ Natural Language Processing - Lecture notes“, University of Wisconsin-Milwaukee. AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/ 55

×