Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Network Identification via Node Knock-out
Download to read offline and view in fullscreen.



Download to read offline

  • Be the first to like this


  1. 1. A New Approach to News Prof. Igor Trajkovski, Ph.D. 1
  2. 2. Motivation – Traditionally, news readers first pick a publication and then look for headlines that interest them. Proprietary 2
  3. 3. News Pipeline Crawling, Extraction Clustering Story Classification Scoring Proprietary
  4. 4. Crawling and Extraction Proprietary 4
  5. 5. Crawl • Most of the Macedonian news sites don’t have RSS feeds. • One level crawl from a set of hubs: (Macedonia) (Economy) (Sport) (Culture) (None) • Many hubs per source. Some of the sources have fixed address of the hubs (A1, Makfax, etc.), for some we need to determine the hub addresses in runtime (Dnevnik, Utrinski, etc.) • Hubs annotated with section name (topic): Macedonia, Balkan, World, Economy, Culture, Fun, Sport, Chronic, None • At the moment, hubs of the sources are provided manually. Proprietary
  6. 6. Article Extraction Segment into title / body / image Heuristics: • Title matches link text and/or HTML title, and is above the body • Body is a big run of unformatted Cyrillic text, below title • Image is extracted from the hub page and has attached link with exactly the same address as the article The same procedure is used for extracting all articles from all sources !!! Proprietary
  7. 7. Clustering Proprietary 7
  8. 8. Clustering Partition news articles into disjoint subsets of clusters, such that: News within a cluster are very similar News in different clusters are very different .. . .. . . . . . . . . . . Proprietary
  9. 9. Word weights Weight is function of word frequency within a document and across all documents TF(w) = frequency of word w in a news article • Intuition: a word appearing more frequently in a text is more likely to be related to its “meaning” IDF(w) = log [N/nw] + 1 • where N = #news articles, nw is #news articles containing w • Intuition: words appearing in many news articles are generally not very informative (e.g., “и”, “е”, “на”, “не”, “со”, “во”, “нели”, “затоа”, etc.) TFIDF: weight of a word in a news article is product of these quantities: • TFIDF(w) = TF(w) x IDF(w) A1, 17:15h, MK Кривична пријава против Андреј Петров петров(28.699) пивара(25.589) андреј(23.382) комиси(23.15) мвр(20.603) кривич(16.449) дистри(16.036) сдсм(15.714) незако(14.921) жалела(14.482)… Proprietary
  10. 10. Story Classification Proprietary 10
  11. 11. Story Classification Based on hub classification tags cluster news cluster news Macedonia Culture Macedonia Fun Region Macedonia Macedonia Culture Macedonia Culture NONE Macedonia Proprietary
  12. 12. Scoring Proprietary 12
  13. 13. Cluster Scoring Logic Cluster Score = quality-of-sources * freshness-of-news Quality of a source: How useful is this source? - Non-dup fraction - Participation in large stories - First publisher of a top story Proprietary
  14. 14. Article Scoring Logic Article Score – Used for ranking within a cluster – Function of: • Age • Quality of source • Title overlap with cluster centroid • Article size • … Proprietary
  15. 15. Stats and Future Work Proprietary 15
  16. 16. УТ Р НО ИН ДН Е 0 20 40 60 80 100 120 140 160 180 200 ВА СКИ ВН М В И АК Е С К ЕД Н О ИК Н И ВР ЈА ЕМ ВЕ Е ЧЕ Р КА А1 М Н АК АЛ5 Ф НЕ АК ТП С РЕ С КИ ВЕ News sources activity РИ С ЛИ Т Ц СИ А А Л ТЕ С Л АТ Ф -М О Р АЛ УМ Ф А Т КУ В ИД РИ И Р ВИ ДИ К А МТ ЈГ В А БР НА М ОК АК Е ДЕ Р НЕ С SE BB TI C M O ES ЗА N .N ЗА ET * in period of 2 working days БА В ТЕ А ЛМ А Proprietary 16
  17. 17. Visitors started to present news 7000 6500 6000 started to Lunch of present news 5000 4500 1.July.2008 4000 discussions on 3000 MK forums Article about 2000 in 1500 Нова Македонија 1000 700 100 180 0 Jul Aug Sept Oct Nov Dec #visitors Source: 8pt, medium gray Proprietary 17
  18. 18. Regional expansion - Slovenia Next stop: Serbia Proprietary
  19. 19. Next to come … • Search of the archive • RSS feeds • Click metrics & personalization adjustable cluster ranking to the user preferences • News alerts emails with link to news that contain provided keywords • Weekly and Monthly news threads • New topics: Technology, Health, etc. • Inclusion of other news sources (currently only 26) • Automatic Hub discovery • Improvements in the clustering algorithms (more sophisticated NLP) СДСМ = СДС, премиер = груевски, нафта = бензин, АМС = агенција за млади и спорт, струја = електрична енергија, etc. Go beyond duplicate detection by measuring new fact introduction Proprietary 19
  20. 20. Acknowledgments - Pajo & Biba for registering in MARNET - Karolina for offering DNS services and HTML/CSS tricks - Igor (Zuljo) for implementing the new design - Nikola and Daniel for implementing text extraction for - many many users for suggesting improvements by sending tons of emails with bugs on pages Proprietary
  21. 21. Thank You! Q&A Proprietary 21


Total views


On Slideshare


From embeds


Number of embeds