Anil timeline construction


Published on

Presentation in Information Retrieval Class.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Anil timeline construction

  1. 1. Clustering and Exploring Search Results using Timeline Constructions (Omar Alonso, Michael Gertz, Recardo Baeza-Yates) presented by   Anil Kumar Attuluri                       10/17/2011                 
  2. 2. Outline <ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Background </li></ul></ul><ul><ul><li>Methods and Prototype </li></ul></ul><ul><ul><li>Evaluation </li></ul></ul><ul><ul><li>Conclusion </li></ul></ul><ul><ul><li>Examples </li></ul></ul>
  3. 3. Motivation
  4. 4. Temporal Information
  5. 5. Temporal Information
  6. 6. Survey results (using Amazon Mechanical Turk) <ul><li>Q. Do you think current timelines for organizing and clustering search results (such as in Google's timeline) are useful for some of your daily search activities? </li></ul><ul><li>76% answered &quot;yes&quot; </li></ul><ul><li>Q. Do you use timelines to explore search results? </li></ul><ul><li>71% answered  &quot;yes&quot; </li></ul>
  7. 7. Use Cases <ul><ul><li>History - information about a place or a person during a period of time in the past. There is no decent timeline for happenings during World War II.  </li></ul></ul><ul><ul><li>Research - information about a topic in a way sorted with oldest first and newest last. No timeline for volcanic activity in US is available. </li></ul></ul><ul><ul><li>Events - details of a soccer world cups listed on a timeline. Timeline based lists are not available. </li></ul></ul>
  8. 8. Background
  9. 9. Hit lists and Clustering <ul><li>Hit lists </li></ul><ul><ul><li>It is a set of all the documents retrieved based on a search query.  </li></ul></ul><ul><ul><li>The documents are sorted based on their rank. </li></ul></ul><ul><li>  </li></ul><ul><li>Clustering </li></ul><ul><ul><li>Clustering is process where the search results (hit lists) are categorized and put into different clusters based on cluster labels. </li></ul></ul><ul><ul><li>Useful for providing a better exploration interface to the end user. </li></ul></ul><ul><li>     </li></ul>
  10. 10. TimeML <ul><ul><li>TimeML is a Formal Specification Language for Events and Temporal Expressions. </li></ul></ul><ul><li>  </li></ul><ul><ul><li>EVENT - A fresh flow of lava, gas and debris erupted there Saturday. </li></ul></ul><ul><li>  </li></ul><ul><ul><li>TIMEX3 - June 11, 1989 , or the Summer of 2002 .  </li></ul></ul><ul><li>   </li></ul><ul><ul><li>SIGNAL - They will investigate the role of the US before , during and after the genocide. </li></ul></ul><ul><ul><li>LINKS - John drove to Boston. During his drive he ate a donut.   </li></ul></ul><ul><li>     </li></ul>
  11. 11. Amazon Mechanical Turk (AMT) <ul><ul><li>Amazon's platform to perform Human Intelligence Tasks (HIT) by humans which cannot be completed by computers yet. </li></ul></ul><ul><li>  </li></ul><ul><ul><li>Requesters - who place HITs , Workers - who perform HITs. </li></ul></ul><ul><li>  </li></ul><ul><ul><li>An Application Programming Interface is provided for the Requesters to submit their HITs and to retrieve the results. </li></ul></ul>
  12. 12. Methods and Prototype
  13. 13. Time Annotated Document Model <ul><li>  Time and Timelines </li></ul><ul><li>          </li></ul><ul><ul><li>Chronon   is an atomic time interval which is a single day. Ex: May 10 2011. </li></ul></ul><ul><ul><li>Granules are contiguous sequence of chronons. Ex: week, month, year. </li></ul></ul><ul><ul><li>Granules composition has a lattice structure. </li></ul></ul><ul><ul><li>Timelines = {T d (day), T w (week), T m (month), T y (year)} </li></ul></ul><ul><ul><li>Chronons have precedence relationship </li></ul></ul>
  14. 14. Time Annotated Document Model <ul><li>  Temporal Expressions </li></ul><ul><li>          </li></ul><ul><ul><li>Document timestamp collected during crawling. </li></ul></ul><ul><ul><li>Explicit temporal expressions. Ex. March 12 2005. </li></ul></ul><ul><ul><li>Implicit temporal expressions. Ex. Columbus day 2008. </li></ul></ul><ul><ul><li>Relative Temporal Expressions. Ex. Two days from now. </li></ul></ul>
  15. 15. Time Annotated Document Model <ul><li>  Temporal Document Profile </li></ul><ul><li>          </li></ul><ul><ul><li>Temporal document profile is defined as:                                                 tdp: D -> [E x C x P]* </li></ul></ul><ul><li>        E =     E e U E i U E r </li></ul><ul><li>        C =     set of all chronons </li></ul><ul><li>        P =     set of all positions of a temporal expression in a  </li></ul><ul><li>                   document </li></ul><ul><ul><li>Simply stating  tdp consists of tuples in the form (e i , c i , p i )  </li></ul></ul><ul><ul><li>The tuples in tdp are organized as follows: </li></ul></ul><ul><li>       ( explicit set, implicit set, dts, realtive set ) </li></ul>
  16. 16. Timeline Construction and Document Exploration <ul><li>  Constructing a Time Outline      </li></ul><ul><li>   </li></ul><ul><ul><li>Chronons are extracted from the hit list L q  . </li></ul></ul><ul><ul><li>Minimum and Maximum chronons describe the lower and upper bound of time outline. </li></ul></ul><ul><ul><li>Documents are organized in a temporal range which forms the time outline. </li></ul></ul>
  17. 17. Timeline Construction and Document Exploration <ul><li>  Document Clustering </li></ul><ul><li>   </li></ul><ul><ul><li>Chronons are normalized. </li></ul></ul><ul><li>      </li></ul><ul><li>      g -  granularity. It can be day, week, month or year </li></ul><ul><ul><li>Documents are mapped to clusters. </li></ul></ul><ul><ul><li>Main cluster and hot spots   are determined. </li></ul></ul>
  18. 18. Timeline Construction and Document Exploration <ul><li>  Ranking Documents in a Cluster </li></ul><ul><li>   </li></ul><ul><ul><li>Ranks are determined as follows. </li></ul></ul><ul><li>         </li></ul><ul><li>      </li></ul><ul><ul><li>Given two documents d and d', d is ranked higher than d' if either of the following two conditions hold. </li></ul></ul><ul><li>     1. rank(d,y j ) > rank(d',y j ) </li></ul><ul><li>     2. rank(d,y j ) = rank(d',y j ) and d is ranked higher in L q than d' </li></ul><ul><li>         L q - set of result documents of a query q </li></ul><ul><li>          y j    - cluster y j     </li></ul>
  19. 19. Timeline Construction and Document Exploration <ul><li>  Cluster Exploration </li></ul><ul><li>   </li></ul><ul><ul><li>The cluster can be refined based on timeline for exploration   of results in each cluster. </li></ul></ul><ul><li>     Ex: refine T y into T m or T w </li></ul><ul><li>Temporal Snippets </li></ul><ul><ul><li>Temporal Snippets outline the main events in a document. They are created by pulling the most relevant sentences that contain temporal expressions. TSnippet algorithm is used. </li></ul></ul>
  20. 20. PROTOTYPE <ul><li>  Document Annotation Pipeline </li></ul><ul><li>         </li></ul><ul><ul><li>First, extract time related metadata like document timestamp during the crawl time from Web server. </li></ul></ul><ul><ul><li>Second, run the POS tagger on each document which tags parts of speech and inserts sentence delimiters needed for temporal document annotation. </li></ul></ul><ul><ul><li>Third, run a temporal expression tagger based on TimeML standard. An XML mark up is created (called tdp) which is added to the document. </li></ul></ul><ul><li>     </li></ul><ul><li>      </li></ul>
  21. 21. PROTOTYPE <ul><li>  Exploratory User Interface </li></ul><ul><li>         </li></ul><ul><li>     </li></ul><ul><li>  </li></ul>
  22. 22. Evaluation
  23. 23. Evaluation <ul><li>  Evaluation guidelines          </li></ul><ul><ul><li>Precision - fraction of retrieved documents that are relevant. All relevant documents must be included in the timeline. </li></ul></ul><ul><ul><li>Presentation - diplaying the timeline in an intuitive graphical user interface. </li></ul></ul><ul><li>      </li></ul>
  24. 24. Evaluation <ul><li>  DMOZ  </li></ul><ul><ul><li>It is a multilingual open content directory. The World cup category was picked for evaluation.  </li></ul></ul><ul><ul><li>Results showed that more clusters were generated by TCluster algorithm and therefore proved to be more precise. </li></ul></ul><ul><li>TimeBank   </li></ul><ul><ul><li>It contains news articles that have been annotated using TimeML.  </li></ul></ul><ul><ul><li>The usage of temporal expressions in documents showed a 50% increase in the number of clusters discovered by TCluster.   </li></ul></ul><ul><li>      </li></ul>
  25. 25. Evaluation <ul><li>  Relevance Evaluation using AMT </li></ul><ul><li>         </li></ul><ul><ul><li>Goal was to evaluate the quality of search results using TCluster in combination with temporal snippets. </li></ul></ul><ul><ul><li>10 random informational queries for Wikipedia featured articles were used.  Average response was 4.04% (with an 80% agreement level) </li></ul></ul><ul><ul><li>Top ten most active topics on Twitter were used. Average response was 4.33% (with an 80% agreement level) </li></ul></ul><ul><li>     </li></ul><ul><li>      </li></ul>
  26. 26. Conclusion
  27. 27. Conclusion <ul><li>         </li></ul><ul><ul><li>A framework to make the search applications time-aware. </li></ul></ul><ul><ul><li>TCluster algorithm provides flexibility allowing users to not only explores the results over a timeline but also to explore the results at multiple time granularities. </li></ul></ul><ul><ul><li>A user engaged in time-related investigations would benefit from this model when traditional information retrieval and search engines cannot offer much. </li></ul></ul><ul><li>     </li></ul><ul><li>      </li></ul>
  28. 28. Examples
  29. 29. Google search based on time
  30. 30.
  31. 31.
  32. 32. Linkedin timeline
  33. 33. Thank You!