Time.Mk

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Time.Mk - Presentation Transcript

    1. A New Approach to News Prof. Igor Trajkovski, Ph.D. 1
    2. Motivation – Traditionally, news readers first pick a publication and then look for headlines that interest them. TIME.mk Proprietary 2
    3. News Pipeline Crawling, Extraction Clustering Story Classification Scoring TIME.mk Proprietary
    4. Crawling and Extraction TIME.mk Proprietary 4
    5. Crawl • Most of the Macedonian news sites don’t have RSS feeds. • One level crawl from a set of hubs: (Macedonia) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1 (Economy) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2 (Sport) http://www.forum.com.mk/index.php?option=com_content&task=blogsection&id=9&Itemid=100 (Culture) http://novamakedonija.com.mk/DesktopDefault.aspx?tabindex=1&tabid=2&CategID=26 (None) http://www.kirilica.com.mk/ • Many hubs per source. Some of the sources have fixed address of the hubs (A1, Makfax, etc.), for some we need to determine the hub addresses in runtime (Dnevnik, Utrinski, etc.) • Hubs annotated with section name (topic): Macedonia, Balkan, World, Economy, Culture, Fun, Sport, Chronic, None • At the moment, hubs of the sources are provided manually. TIME.mk Proprietary
    6. Article Extraction Segment into title / body / image Heuristics: • Title matches link text and/or HTML title, and is above the body • Body is a big run of unformatted Cyrillic text, below title • Image is extracted from the hub page and has attached link with exactly the same address as the article The same procedure is used for extracting all articles from all sources !!! TIME.mk Proprietary
    7. Clustering TIME.mk Proprietary 7
    8. Clustering Partition news articles into disjoint subsets of clusters, such that: News within a cluster are very similar News in different clusters are very different .. . .. . . . . . . . . . . TIME.mk Proprietary
    9. Word weights Weight is function of word frequency within a document and across all documents TF(w) = frequency of word w in a news article • Intuition: a word appearing more frequently in a text is more likely to be related to its “meaning” IDF(w) = log [N/nw] + 1 • where N = #news articles, nw is #news articles containing w • Intuition: words appearing in many news articles are generally not very informative (e.g., “и”, “е”, “на”, “не”, “со”, “во”, “нели”, “затоа”, etc.) TFIDF: weight of a word in a news article is product of these quantities: • TFIDF(w) = TF(w) x IDF(w) A1, 17:15h, MK Кривична пријава против Андреј Петров петров(28.699) пивара(25.589) андреј(23.382) комиси(23.15) мвр(20.603) кривич(16.449) дистри(16.036) сдсм(15.714) незако(14.921) жалела(14.482)… TIME.mk Proprietary
    10. Story Classification TIME.mk Proprietary 10
    11. Story Classification Based on hub classification tags cluster news cluster news Macedonia Culture Macedonia Fun Region Macedonia Macedonia Culture Macedonia Culture NONE Macedonia TIME.mk Proprietary
    12. Scoring TIME.mk Proprietary 12
    13. Cluster Scoring Logic Cluster Score = quality-of-sources * freshness-of-news Quality of a source: How useful is this source? - Non-dup fraction - Participation in large stories - First publisher of a top story TIME.mk Proprietary
    14. Article Scoring Logic Article Score – Used for ranking within a cluster – Function of: • Age • Quality of source • Title overlap with cluster centroid • Article size • … TIME.mk Proprietary
    15. Stats and Future Work TIME.mk Proprietary 15
    16. УТ Р НО ИН ДН Е 0 20 40 60 80 100 120 140 160 180 200 ВА СКИ ВН М В И АК Е С К ЕД Н О ИК Н И ВР ЈА ЕМ ВЕ Е ЧЕ Р КА А1 М Н АК АЛ5 Ф НЕ АК ТП С РЕ С КИ ВЕ News sources activity РИ С ЛИ Т Ц СИ А А Л ТЕ С Л АТ Ф -М О Р АЛ УМ Ф А Т КУ В ИД РИ И Р ВИ ДИ К А МТ ЈГ В А БР НА М ОК АК Е ДЕ Р НЕ С SE BB TI C M O ES ЗА N .N ЗА ET * in period of 2 working days БА В ТЕ А ЛМ А TIME.mk Proprietary 16
    17. Visitors 365.com.mk started to present TIME.mk news 7000 6500 6000 ON.net started to Lunch of present TIME.mk news 5000 TIME.mk 4500 1.July.2008 4000 discussions on 3000 MK forums Article about 2000 TIME.mk in 1500 Нова Македонија 1000 700 100 180 0 Jul Aug Sept Oct Nov Dec #visitors Source: 8pt, medium gray TIME.mk Proprietary 17
    18. Regional expansion - Slovenia Next stop: Serbia TIME.mk Proprietary
    19. Next to come … • Search of the archive • RSS feeds • Click metrics & personalization adjustable cluster ranking to the user preferences • News alerts emails with link to news that contain provided keywords • Weekly and Monthly news threads • New topics: Technology, Health, etc. • Inclusion of other news sources (currently only 26) • Automatic Hub discovery • Improvements in the clustering algorithms (more sophisticated NLP) СДСМ = СДС, премиер = груевски, нафта = бензин, АМС = агенција за млади и спорт, струја = електрична енергија, etc. Go beyond duplicate detection by measuring new fact introduction TIME.mk Proprietary 19
    20. Acknowledgments - Pajo & Biba for registering TIME.mk in MARNET - Karolina for offering DNS services and HTML/CSS tricks - Igor (Zuljo) for implementing the new design - Nikola and Daniel for implementing text extraction for TIMES.si - many many users for suggesting improvements by sending tons of emails with bugs on TIME.mk pages TIME.mk Proprietary
    21. Thank You! Q&A TIME.mk Proprietary 21

    + Darko BuldioskiDarko Buldioski, 11 months ago

    custom

    2576 views, 0 favs, 2 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 2576
      • 2453 on SlideShare
      • 123 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 6
    Most viewed embeds
    • 106 views on http://newmedia.org.mk
    • 17 views on http://wildfire.gigya.com

    more

    All embeds
    • 106 views on http://newmedia.org.mk
    • 17 views on http://wildfire.gigya.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories