Time.Mk

6,958 views

Published on

Published in: News & Politics, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
6,958
On SlideShare
0
From Embeds
0
Number of Embeds
242
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Time.Mk

  1. 1. A New Approach to News Prof. Igor Trajkovski, Ph.D. 1
  2. 2. Motivation – Traditionally, news readers first pick a publication and then look for headlines that interest them. TIME.mk Proprietary 2
  3. 3. News Pipeline Crawling, Extraction Clustering Story Classification Scoring TIME.mk Proprietary
  4. 4. Crawling and Extraction TIME.mk Proprietary 4
  5. 5. Crawl • Most of the Macedonian news sites don’t have RSS feeds. • One level crawl from a set of hubs: (Macedonia) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1 (Economy) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2 (Sport) http://www.forum.com.mk/index.php?option=com_content&task=blogsection&id=9&Itemid=100 (Culture) http://novamakedonija.com.mk/DesktopDefault.aspx?tabindex=1&tabid=2&CategID=26 (None) http://www.kirilica.com.mk/ • Many hubs per source. Some of the sources have fixed address of the hubs (A1, Makfax, etc.), for some we need to determine the hub addresses in runtime (Dnevnik, Utrinski, etc.) • Hubs annotated with section name (topic): Macedonia, Balkan, World, Economy, Culture, Fun, Sport, Chronic, None • At the moment, hubs of the sources are provided manually. TIME.mk Proprietary
  6. 6. Article Extraction Segment into title / body / image Heuristics: • Title matches link text and/or HTML title, and is above the body • Body is a big run of unformatted Cyrillic text, below title • Image is extracted from the hub page and has attached link with exactly the same address as the article The same procedure is used for extracting all articles from all sources !!! TIME.mk Proprietary
  7. 7. Clustering TIME.mk Proprietary 7
  8. 8. Clustering Partition news articles into disjoint subsets of clusters, such that: News within a cluster are very similar News in different clusters are very different .. . .. . . . . . . . . . . TIME.mk Proprietary
  9. 9. Word weights Weight is function of word frequency within a document and across all documents TF(w) = frequency of word w in a news article • Intuition: a word appearing more frequently in a text is more likely to be related to its “meaning” IDF(w) = log [N/nw] + 1 • where N = #news articles, nw is #news articles containing w • Intuition: words appearing in many news articles are generally not very informative (e.g., “и”, “е”, “на”, “не”, “со”, “во”, “нели”, “затоа”, etc.) TFIDF: weight of a word in a news article is product of these quantities: • TFIDF(w) = TF(w) x IDF(w) A1, 17:15h, MK Кривична пријава против Андреј Петров петров(28.699) пивара(25.589) андреј(23.382) комиси(23.15) мвр(20.603) кривич(16.449) дистри(16.036) сдсм(15.714) незако(14.921) жалела(14.482)… TIME.mk Proprietary
  10. 10. Story Classification TIME.mk Proprietary 10
  11. 11. Story Classification Based on hub classification tags cluster news cluster news Macedonia Culture Macedonia Fun Region Macedonia Macedonia Culture Macedonia Culture NONE Macedonia TIME.mk Proprietary
  12. 12. Scoring TIME.mk Proprietary 12
  13. 13. Cluster Scoring Logic Cluster Score = quality-of-sources * freshness-of-news Quality of a source: How useful is this source? - Non-dup fraction - Participation in large stories - First publisher of a top story TIME.mk Proprietary
  14. 14. Article Scoring Logic Article Score – Used for ranking within a cluster – Function of: • Age • Quality of source • Title overlap with cluster centroid • Article size • … TIME.mk Proprietary
  15. 15. Stats and Future Work TIME.mk Proprietary 15
  16. 16. УТ Р НО ИН ДН Е 0 20 40 60 80 100 120 140 160 180 200 ВА СКИ ВН М В И АК Е С К ЕД Н О ИК Н И ВР ЈА ЕМ ВЕ Е ЧЕ Р КА А1 М Н АК АЛ5 Ф НЕ АК ТП С РЕ С КИ ВЕ News sources activity РИ С ЛИ Т Ц СИ А А Л ТЕ С Л АТ Ф -М О Р АЛ УМ Ф А Т КУ В ИД РИ И Р ВИ ДИ К А МТ ЈГ В А БР НА М ОК АК Е ДЕ Р НЕ С SE BB TI C M O ES ЗА N .N ЗА ET * in period of 2 working days БА В ТЕ А ЛМ А TIME.mk Proprietary 16
  17. 17. Visitors 365.com.mk started to present TIME.mk news 7000 6500 6000 ON.net started to Lunch of present TIME.mk news 5000 TIME.mk 4500 1.July.2008 4000 discussions on 3000 MK forums Article about 2000 TIME.mk in 1500 Нова Македонија 1000 700 100 180 0 Jul Aug Sept Oct Nov Dec #visitors Source: 8pt, medium gray TIME.mk Proprietary 17
  18. 18. Regional expansion - Slovenia Next stop: Serbia TIME.mk Proprietary
  19. 19. Next to come … • Search of the archive • RSS feeds • Click metrics & personalization adjustable cluster ranking to the user preferences • News alerts emails with link to news that contain provided keywords • Weekly and Monthly news threads • New topics: Technology, Health, etc. • Inclusion of other news sources (currently only 26) • Automatic Hub discovery • Improvements in the clustering algorithms (more sophisticated NLP) СДСМ = СДС, премиер = груевски, нафта = бензин, АМС = агенција за млади и спорт, струја = електрична енергија, etc. Go beyond duplicate detection by measuring new fact introduction TIME.mk Proprietary 19
  20. 20. Acknowledgments - Pajo & Biba for registering TIME.mk in MARNET - Karolina for offering DNS services and HTML/CSS tricks - Igor (Zuljo) for implementing the new design - Nikola and Daniel for implementing text extraction for TIMES.si - many many users for suggesting improvements by sending tons of emails with bugs on TIME.mk pages TIME.mk Proprietary
  21. 21. Thank You! Q&A TIME.mk Proprietary 21

×