Semantics hidden within co-occurrence patterns

  • 1,402 views
Uploaded on

Slides of the IEEE CS Society talk delivered at Yahoo India, Nov 20 2009.

Slides of the IEEE CS Society talk delivered at Yahoo India, Nov 20 2009.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,402
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
38
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Semantics hidden within co-occurrence patterns A bottom-up approach to the Semantic Web? Srinath Srinivasa IIIT Bangalore sri@iiitb.ac.in IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 2. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 3. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 4. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Conventional WebIR and co-occurrence Lexical feature extraction: Bag-of-words model Document vectorization Implicit assumption of independence of dimensions Vector space reduction and spectral analyses for identifying hidden semantics (Ex: LSA, SVD, Clustering, etc.) In human languages, lexical terms are not only not independent of one another, important semantic structures are inherent in the way terms co-occur. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 5. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems Some motivational problems to show limitations of purely lexical approaches to IR: The topical anchor problem “If ever a player has overshadowed Sachin Tendulkar for sheer class of batsmanship, it is V V S Laxman. After a record 353-run fourth-wicket partnership in the 2004 Sydney Test when Laxman hit 30 fours in his 178 to Tendulkar’s 33 in his unbeaten 241, the master put the artistry of V V S in perspective.” What is the best topic of this paragraph: Sachin Tendulkar, V V S Laxman, Sydney, Australia, Cricket, Test Match IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 6. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The semantic attributes problem Given that a user has searched for the term “Malm¨” which of the following o keywords can be termed as “attributes” that enhance the meaning represented by Malm¨ : o Driving History Mileage Weather Symptoms Elephant A LTEX beamer Infringement IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 7. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The topical marker problem The US Federal Aviation Regulations Sec 380.12 states that: The charter operator may not cancel a charter for any reason (including insufficient participation), except for circumstances that make it physically impossible to perform the charter trip, less than 10 days before the scheduled date of departure of the outbound trip. If the charter operator cancels 10 or more days before the scheduled date of departure, the operator must so notify each participant in writing within 7 days after the cancellation but in any event not less than 10 days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 days before scheduled departure (i.e., for circumstances that make it physically impossible to perform the charter trip), the operator must get the message to each participant as soon as possible. If a user who has booked a ticket with a charter operator finds out that her flight has been cancelled suddenly without notice and wants to confront the operator; what should she search for: charter operator, FAR, cancellation, scheduled trip, Sec 380, operator, notification, . . . IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 8. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The topical marker problem The US Federal Aviation Regulations Sec 380.12 states that: The charter operator may not cancel a charter for any reason (including insufficient participation), except for circumstances that make it physically impossible to perform the charter trip, less than 10 days before the scheduled date of departure of the outbound trip. If the charter operator cancels 10 or more days before the scheduled date of departure, the operator must so notify each participant in writing within 7 days after the cancellation but in any event not less than 10 days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 days before scheduled departure (i.e., for circumstances that make it physically impossible to perform the charter trip), the operator must get the message to each participant as soon as possible. If a user who has booked a ticket with a charter operator finds out that her flight has been cancelled suddenly without notice and wants to confront the operator; what should she search for: charter operator, FAR, cancellation, scheduled trip, Sec 380, operator, notification, . . . IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 9. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The theme problem: Article 1 A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only minor injuries. Article 2 La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey. Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river. Article 3 Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore. Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 10. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The theme problem: Article 1 A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only minor injuries. Article 2 La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey. Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river. Article 3 Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore. Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 11. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The theme problem: Article 1 A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only minor injuries. Article 2 La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey. Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river. Article 3 Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore. Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 12. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The theme problem: Article 1 A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only minor injuries. Article 2 La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey. Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river. Article 3 Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore. Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 13. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and Meaning Hebbian learning Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind, which states that synaptic plasticity between neurons are determined by repeated and persistent stimulation of the pre- and post-synaptic cells [2]. This is also summarized as: Cells that fire together, wire together Co-occurrence and the language instinct Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. An interesting example is the “wug” test (cf. [5]): That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug; these are . The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for a moment): The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus is lotii? lotes? lotuses? IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 14. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and Meaning Hebbian learning Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind, which states that synaptic plasticity between neurons are determined by repeated and persistent stimulation of the pre- and post-synaptic cells [2]. This is also summarized as: Cells that fire together, wire together Co-occurrence and the language instinct Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. An interesting example is the “wug” test (cf. [5]): That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug; these are . The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for a moment): The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus is lotii? lotes? lotuses? IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 15. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and meaning Meaning is usage The analytic philosophy worldview: Meaning is usage [1] can be explained by representing usage as co-occurrence analysis. Consider the following paragraphs: Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the best mileage for pqers in its category. My pqer can seat five people and is a good candidate for pqer-pooling. On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This earthquake triggered a huge tsunami that has been the deadliest in history. We have developed an applet to simulate the path taken by the tsunami. You can run this applet in any browser that has Java enabled. In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term “Java” are both resolved by looking at other terms that co-occur with them. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 16. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and meaning Meaning is usage The analytic philosophy worldview: Meaning is usage [1] can be explained by representing usage as co-occurrence analysis. Consider the following paragraphs: Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the best mileage for pqers in its category. My pqer can seat five people and is a good candidate for pqer-pooling. On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This earthquake triggered a huge tsunami that has been the deadliest in history. We have developed an applet to simulate the path taken by the tsunami. You can run this applet in any browser that has Java enabled. In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term “Java” are both resolved by looking at other terms that co-occur with them. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 17. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and meaning Meaning is usage The analytic philosophy worldview: Meaning is usage [1] can be explained by representing usage as co-occurrence analysis. Consider the following paragraphs: Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the best mileage for pqers in its category. My pqer can seat five people and is a good candidate for pqer-pooling. On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This earthquake triggered a huge tsunami that has been the deadliest in history. We have developed an applet to simulate the path taken by the tsunami. You can run this applet in any browser that has Java enabled. In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term “Java” are both resolved by looking at other terms that co-occur with them. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 18. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and meaning Meaning is usage The analytic philosophy worldview: Meaning is usage [1] can be explained by representing usage as co-occurrence analysis. Consider the following paragraphs: Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the best mileage for pqers in its category. My pqer can seat five people and is a good candidate for pqer-pooling. On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This earthquake triggered a huge tsunami that has been the deadliest in history. We have developed an applet to simulate the path taken by the tsunami. You can run this applet in any browser that has Java enabled. In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term “Java” are both resolved by looking at other terms that co-occur with them. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 19. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 20. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Capturing co-occurrence We are given a document corpus that is represented as a set of “contexts”: C = {C1 , C2 , . . . Cn } Depending on the specific problem, a context may take various forms like: sentence, paragraph, document, etc. Two entities ei and ej are said to co-occur (denoted as ei ej ) if there is some context C such that ei , ej ∈ C The support for a co-occurring pair ei ej is the probability of finding this co-occurrence in any given context C in the corpus. In other words, the support is the joint probability P(ei , ej ) Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we focus on pairwise co-occurrences and derive higher order semantics when required. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 21. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Capturing co-occurrence We are given a document corpus that is represented as a set of “contexts”: C = {C1 , C2 , . . . Cn } Depending on the specific problem, a context may take various forms like: sentence, paragraph, document, etc. Two entities ei and ej are said to co-occur (denoted as ei ej ) if there is some context C such that ei , ej ∈ C The support for a co-occurring pair ei ej is the probability of finding this co-occurrence in any given context C in the corpus. In other words, the support is the joint probability P(ei , ej ) Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we focus on pairwise co-occurrences and derive higher order semantics when required. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 22. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Capturing co-occurrence We are given a document corpus that is represented as a set of “contexts”: C = {C1 , C2 , . . . Cn } Depending on the specific problem, a context may take various forms like: sentence, paragraph, document, etc. Two entities ei and ej are said to co-occur (denoted as ei ej ) if there is some context C such that ei , ej ∈ C The support for a co-occurring pair ei ej is the probability of finding this co-occurrence in any given context C in the corpus. In other words, the support is the joint probability P(ei , ej ) Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we focus on pairwise co-occurrences and derive higher order semantics when required. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 23. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graphs Co-occurrence graph A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : → indicates support for the co-occurrence Co-occurrence versus n-partite graphs Semantic co-occurrence graphs A semantic co-occurrence graph is a co-occurrence graph that is augmented with a concept hierarchy. A concept hierarchy is defined by one or more partial orders of the form: ⊆ E × E , representing relationships like is-a and is-in, that are reflexive, anti-symmetric and transitive. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 24. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graphs Co-occurrence graph A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : → indicates support for the co-occurrence Co-occurrence versus n-partite graphs Semantic co-occurrence graphs A semantic co-occurrence graph is a co-occurrence graph that is augmented with a concept hierarchy. A concept hierarchy is defined by one or more partial orders of the form: ⊆ E × E , representing relationships like is-a and is-in, that are reflexive, anti-symmetric and transitive. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 25. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graphs Co-occurrence graph A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : → indicates support for the co-occurrence Co-occurrence versus n-partite graphs Semantic co-occurrence graphs A semantic co-occurrence graph is a co-occurrence graph that is augmented with a concept hierarchy. A concept hierarchy is defined by one or more partial orders of the form: ⊆ E × E , representing relationships like is-a and is-in, that are reflexive, anti-symmetric and transitive. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 26. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graph Example: Concept hierarchy construction 1 Start with a base Ontology 2 Use co-occurrence patterns to guess conceptual relationships across terms 3 Use concept hierarchy to identify deeper co-occurrence patterns 4 Repeat from step 2 in a semi-automated fashion until algorithm stabilizes IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 27. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graph Example: Concept hierarchy construction 1 Start with a base Ontology 2 Use co-occurrence patterns to guess conceptual relationships across terms 3 Use concept hierarchy to identify deeper co-occurrence patterns 4 Repeat from step 2 in a semi-automated fashion until algorithm stabilizes IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 28. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graphs Characteristics of co-occurrence graphs Triadic closure (highly clustered) Disconnected components or a single component of very small diameter Co-occurrence graph of all noun phrases in Wikipedia has a diameter of 4 Co-occurrence support for entity pairs follow a power-law IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 29. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 30. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Co-citation and bibliographic coupling are important metrics in several datasets like scientific literature, web pages, wikis, tagging systems like delicious, etc. Co-citation of a pair of documents corresponds to the co-occurrence of these references (Ex. URLs) in a context Pair-wise co-citation graphs have the same properties as co-occurrence graphs IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 31. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Patterns Hyperlink distance across pairs of highly co-cited pages [8] 300 12000 250 10000 8000 200 6000 F 150 F 4000 100 2000 50 0 0 1 2 3 4 5 6 7 ax ax km m 1 2 3 4 5 6 7 kmax >kmax >k k k Figure: Hyperlink distance across pairs of Figure: Hyperlink distance across pairs of highly co-cited Web pages highly co-cited Wikipedia pages IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 32. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Patterns Hyperlink distance across pairs of highly co-cited pages Endorsement of a citation Topical aggregation Page A endorses the content of page B Document A represents content about a Users reading page A, traverses this link and “higher-level” topic in terms of is-a or is-in finds page B useful too relationships; and links to (hence co-cites) several pages on “lower-level” topics Users create their own pages citing both A and B Pages on the “lower-level” topics usually cite back the page on the “higher-level” topic, If A has several outgoing links, and only some hence giving a citation distance of 2 among pairs of outlinks are co-cited, then co-citation themselves can be seen as an endorsement of the citation Nepotistic co-citations Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 33. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Patterns Hyperlink distance across pairs of highly co-cited pages Endorsement of a citation Topical aggregation Page A endorses the content of page B Document A represents content about a Users reading page A, traverses this link and “higher-level” topic in terms of is-a or is-in finds page B useful too relationships; and links to (hence co-cites) several pages on “lower-level” topics Users create their own pages citing both A and B Pages on the “lower-level” topics usually cite back the page on the “higher-level” topic, If A has several outgoing links, and only some hence giving a citation distance of 2 among pairs of outlinks are co-cited, then co-citation themselves can be seen as an endorsement of the citation Nepotistic co-citations Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 34. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Patterns Hyperlink distance across pairs of highly co-cited pages Endorsement of a citation Topical aggregation Page A endorses the content of page B Document A represents content about a Users reading page A, traverses this link and “higher-level” topic in terms of is-a or is-in finds page B useful too relationships; and links to (hence co-cites) several pages on “lower-level” topics Users create their own pages citing both A and B Pages on the “lower-level” topics usually cite back the page on the “higher-level” topic, If A has several outgoing links, and only some hence giving a citation distance of 2 among pairs of outlinks are co-cited, then co-citation themselves can be seen as an endorsement of the citation Nepotistic co-citations Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 35. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation graph of a web crawl Pairs of pages with at least 100 non-nepotistic co-citations IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 36. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation graph of a web crawl Co-citation graph depicts non-nepotistic co-citations of at least 100 or more across pairs of pages In addition to being made of disconnected components, the graph also shows various recurring structural motifs like: Star Clique Clique chain Dumb-bell Interpretations for the above motifs along with examples are explained in Mutalikdesai and Srinivasa (2009) [4] IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 37. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Endorsed hyperlink graph (EHG) On the web, co-citations usually implies a citation. Hence the EHG is essentially a directed version of the co-citation graph. Some EHG components are depicted below: EHG clique chain IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 38. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Endorsed citation graph (ECG) for scientific literature ECG of citation info obtained from CiteSeer IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 39. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Endorsed citation graph The ECG over scientific literature data (using CiteSeer) shows similar componentization of the graph, except, the ECG has one giant component Citation in scientific literature has some subtle differences from hyperlink citations Scientific literature citations are always into the past Very rarely (if at all) do scientific literature citations form cyclic structures ECG comprises mostly of weakly connected directed graph components, while EHG may contain strongly connected components IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 40. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References ERank Importance of a page within an EHG ERank is an authority score of a page within an EHG (ECG) component Depicts reachability of the page within the component ERank scores in a component shown to be uncorrelated to the PageRank scores of pages of that component IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 41. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References EndorSeer A Firefox plugin for augmented browsing of Citeseer Currently shows endorsed citations from among the list of citations from any paper Currently underway: Show the ECG component and ECG neighbourhood of a paper IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 42. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 43. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Topical Anchors [6, 7] Motivation Example: “Will my oral insulin drugs, along with my hypertension and high blood glucose, have any side effects on the health of my pancreas?” Can a machine detect diabetes as the context? Another example: A document containing the words, Andy Roddick, Roger Federer and Rafael Nadal. How likely is it that the word Tennis will be mentioned (semantically) when discussing about these players? IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 44. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Topical Anchors [6, 7] Motivation Example: “Will my oral insulin drugs, along with my hypertension and high blood glucose, have any side effects on the health of my pancreas?” Can a machine detect diabetes as the context? Another example: A document containing the words, Andy Roddick, Roger Federer and Rafael Nadal. How likely is it that the word Tennis will be mentioned (semantically) when discussing about these players? IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 45. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence context Given a set of query terms, the co-occurrence context is defined as the subgraph formed by the query terms and the set of terms that co-occur with at least one of the terms Conjecture: The topical anchor of a set of terms, is a highly authoritative term that lies with the co-occurrence context of the query terms IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 46. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Online Page Importance Computation Each node i in the context is intialised with a cash ci . A node a is picked at random and the cash ca is added to its history ha . Then ca is distributed amongst all its neighbours proportional to the edge weights. This process is iterated till the ratio of hi s becomes a near constant. Node with the largest hi is chosen as the most central node. Unfortunately OPIC was seen to be unsuitable for determining topical anchors since it tends to find central nodes for the entire graph IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 47. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Cash Leaking Random Walk Cooccurrence graphs have extremely small diameters (4-5). Roger Federer to feral child in two hops. Football becomes most central to Roger Federer and Rafael Nadal instead of Tennis. Solution: Cash Leakage IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 48. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Bias and History Vectors There is a hidden bias between query words for the way centrality is computed. Example: Jim Carrey, Hugh Grant, Rajkumar Bias due to difference in neighbourhood sizes Bias due to polysemy Example: Java, Beans, Kaffe IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 49. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Bias examples Query Terms Topical Anchors Java, Beans, Kaffe Programming language, Indonesia, Food United States Dollar, Euro, West French language, Guinea, Guinea- African CFA franc Bissau Bayes, Euclid, Ramanujan, Probability, Mathematics, Number Bernoulli MIT, Stanford, IIT University, Indian Institute of Tech- nology, Bombay Leaf, Fruit, Stem, Photosynthesis Linguistics, Plant, Tree Bernoulli, Poisson, Weibull, Bino- Godwin, Norway, Harold Godwin- mial son Table: Examples with irrelevant topical anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 50. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Solution to the topic bias problem Labelled cash. Vector models of CLRW Cash from each of the query term qi is given a “colour” ci . The cash history at any node is hence a vector of the form (v1 , v2 , . . . vn ) showing cash flow history for each of the colours. The vector is then normalized as: vi vi = v ˆ where v = max vi and vi ∈ [0, 1] ˆ i IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 51. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Projection Projection The line joining 0n to 1n represents points where all query terms have contributed equally to the cash history. This is called the baseline Hence, for any given node, its projection onto the baseline represents the importance of the node in being a topical anchor IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 52. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Euclidean Distance Eucledian distance Eucledian metric computes the L2 distance from the normalized cash history vector of a candidate node with 1n Favours uniformity in cash history distribution over overall magnitude of the cash history IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 53. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Cosine Similarity Cosine similarity Computes the cosine between a given node’s normalized cash history vector and 1n Another metric for factoring both uniformity in cash distribution and magnitude IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 54. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Example results Query Terms Projection Eucledian Cosine United States Dol- French language, Currency, Bank, Currency, Bank, lar, Euro, West Guinea, Guinea- France France African CFA franc Bissau Bayes, Euclid, Ra- Probability, Math- Mathematics, Mathematics, manujan, Bernoulli ematics, Number Mathematician, Mathematician, Euler Probability distri- bution MIT, Stanford, IIT University, Indian University, Col- University, Col- Institute of Tech- lege, Technology lege, Science nology, Bombay Leaf, Fruit, Stem, Linguistics, Plant, Plant, Tree, Plant, Tree, Photosynthesis Tree Species Species Bernoulli, Poisson, Godwin, Norway, Mathematics, Mathematics, Weibull, Binomial Harold Godwinson Probability, Ex- Probability, Statis- pected Value tics IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 55. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References User evaluation Experimental Setup: 86 volunteer users were given a set of queries and asked to provide topical labels for these queries ranked according to their perceived importance 66 volunteers answered 100 questions, while the rest answered 30 random questions chosen from the 100 questions User responses were charted for consistency in results (chart shown below) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 56. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References User evaluation CLRW against tf-idf and OPIC IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 57. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Comparison Comparison with Automatic Topic Labeling algorithm [3] Caveats: Comparison with Eucledian algorithm. ATL requires document contexts where the topical anchor is present (unlike CLRW which searches on the co-occurrence graph built over a corpus) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 58. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Future Work Several open questions.. Topical markers, semantic siblings Co-occurrence semantics when coupled with concept hierarchies Automatic detection of semantic relations based on co-occurrence Automatic attribute identification Thank You! IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 59. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References [1] A. Biletzki and A. Matar. Ludwig wittgenstein (second revision). Stanford Encyclopedia of Philosophy, May 2009. [2] Gerstner and Kistler. Spiking Neuron Models. Single Neurons, Populations, Plasticity. Cambridge University Press, 2002. [3] Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 490–499, New York, NY, USA, 2007. ACM. [4] M. R. Mutalikdesai and S. Srinivasa. Co-citations as endorsements of citations. Submitted for publication, 2009. [5] S. Pinker. The Language Instinct. Harper Perennial Modern Classics, 2007. [6] A. R. Rachakonda and S. Srinivasa. Finding the topical anchors of a context using lexical cooccurrence data. In Proceedings of ACM Conference on Information and Knowledge Management (CIKM), 2009. [7] A. R. Rachakonda and S. Srinivasa. Vector-based ranking techniques for identifying the topical anchors of a context. In Proceedings of the 15th International Conference on Management of Data (COMAD), 2009. [8] S. Reddy, S. Srinivasa, and M. R. Mutalikdesai. Measures of ”ignorance” on the web. In Proceedings of the International Conference on Management of Data (COMAD), Dec 2006. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore