SlideShare a Scribd company logo
1 of 21
Download to read offline
A New Approach to News




Prof. Igor Trajkovski, Ph.D.




                               1
Motivation
   – Traditionally, news readers first pick a publication
     and then look for headlines that interest them.




                                                            TIME.mk Proprietary   2
News Pipeline

  Crawling,
  Extraction




 Clustering




    Story
Classification




  Scoring


                 TIME.mk Proprietary
Crawling and Extraction




                          TIME.mk Proprietary   4
Crawl
• Most of the Macedonian news sites don’t have RSS feeds.

• One level crawl from a set of hubs:
     (Macedonia) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1
     (Economy) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2
     (Sport) http://www.forum.com.mk/index.php?option=com_content&task=blogsection&id=9&Itemid=100
     (Culture) http://novamakedonija.com.mk/DesktopDefault.aspx?tabindex=1&tabid=2&CategID=26
     (None) http://www.kirilica.com.mk/

• Many hubs per source.
     Some of the sources have fixed address of the hubs (A1, Makfax, etc.), for some
     we need to determine the hub addresses in runtime (Dnevnik, Utrinski, etc.)

• Hubs annotated with section name (topic):
     Macedonia, Balkan, World, Economy, Culture, Fun, Sport, Chronic, None

• At the moment, hubs of the sources are provided manually.

                                                                                 TIME.mk Proprietary
Article Extraction

Segment into title / body / image


                                           Heuristics:
                                           • Title matches link text and/or HTML
                                             title, and is above the body
                                           • Body is a big run of unformatted
                                             Cyrillic text, below title
                                           • Image is extracted from the hub page
                                             and has attached link with exactly the
                                             same address as the article




   The same procedure is used for extracting all articles from all sources !!!
                                                                          TIME.mk Proprietary
Clustering




             TIME.mk Proprietary   7
Clustering

Partition news articles into disjoint subsets of clusters,
such that:
     News within a cluster are very similar
     News in different clusters are very different



     ..       . .. .
     . .
          . .    .
        . . .
          .
                                                         TIME.mk Proprietary
Word weights
Weight is function of word frequency within a document and across all documents

TF(w) = frequency of word w in a news article
   •   Intuition: a word appearing more frequently in a text is more likely to be related to
       its “meaning”

IDF(w) = log [N/nw] + 1
   •   where N = #news articles, nw is #news articles containing w
   •   Intuition: words appearing in many news articles are generally not very informative
       (e.g., “и”, “е”, “на”, “не”, “со”, “во”, “нели”, “затоа”, etc.)

TFIDF: weight of a word in a news article is product of these quantities:
   •   TFIDF(w) = TF(w) x IDF(w)

          A1, 17:15h, MK
          Кривична пријава против Андреј Петров

          петров(28.699) пивара(25.589) андреј(23.382) комиси(23.15) мвр(20.603) кривич(16.449)
          дистри(16.036) сдсм(15.714) незако(14.921) жалела(14.482)…


                                                                                                  TIME.mk Proprietary
Story Classification




                       TIME.mk Proprietary   10
Story Classification

Based on hub classification tags

cluster news                       cluster news
Macedonia                          Culture
Macedonia                          Fun
Region          Macedonia          Macedonia      Culture
Macedonia                          Culture
NONE                               Macedonia




                                                            TIME.mk Proprietary
Scoring




          TIME.mk Proprietary   12
Cluster Scoring Logic

 Cluster Score = quality-of-sources * freshness-of-news



          Quality of a source: How useful is this
          source?
                           - Non-dup fraction
                           - Participation in large stories
                     - First publisher of a top story




                                                              TIME.mk Proprietary
Article Scoring Logic

Article Score
    – Used for ranking within a cluster
    – Function of:
         • Age
         • Quality of source
         • Title overlap with cluster
           centroid
         • Article size
         • …




                                          TIME.mk Proprietary
Stats and Future Work




                        TIME.mk Proprietary   15
УТ
                                                          Р
                                                      НО ИН ДН
                                                                 Е




                                                                            0
                                                                                20
                                                                                     40
                                                                                          60
                                                                                               80
                                                                                                    100
                                                                                                          120
                                                                                                                140
                                                                                                                      160
                                                                                                                            180
                                                                                                                                  200
                                                         ВА СКИ ВН
                                                            М   В И
                                                             АК Е С К
                                                               ЕД Н
                                                                 О ИК
                                                                   Н
                                                                     И
                                                                 ВР ЈА
                                                                    ЕМ
                                                                 ВЕ Е
                                                                    ЧЕ
                                                                       Р
                                                                КА А1
                                                              М    Н
                                                               АК АЛ5
                                                                   Ф
                                                              НЕ АК
                                                                  ТП С
                                                                     РЕ
                                                                        С
                                                             КИ ВЕ
                                                                                                                                        News sources activity




                                                               РИ С
                                                                   ЛИ Т
                                                                      Ц
                                                                  СИ А
                                                               А Л ТЕ
                                                                  С Л
                                                                    АТ
                                                                 Ф -М
                                                                  О
                                                                    Р
                                                               АЛ УМ
                                                                  Ф
                                                                    А
                                                                      Т
                                                                  КУ В
                                                              ИД РИ
                                                                 И      Р
                                                                   ВИ
                                                                       ДИ

                                                               К А МТ
                                                                  ЈГ В
                                                                     А
                                                                БР НА
                                                             М ОК
                                                              АК Е
                                                                 ДЕ Р
                                                                    НЕ
                                                                        С
                                                               SE    BB
                                                                  TI C
                                                                     M
                                                                 O ES
                                                             ЗА N .N
                                                                ЗА ET
                      * in period of 2 working days




                                                                   БА
                                                                      В
                                                                 ТЕ А
                                                                    ЛМ
                                                                        А
TIME.mk Proprietary
16
Visitors

                                                                          365.com.mk started to
                                                                          present TIME.mk news
     7000                                                                                             6500

     6000
                                                           ON.net started to
                Lunch of                                   present TIME.mk news
     5000
                TIME.mk
                                                                                      4500
                1.July.2008
     4000

                                                discussions on
     3000                                       MK forums

                              Article about
     2000                     TIME.mk in                               1500
                              Нова Македонија
     1000
                                                    700
                     100           180
         0

                      Jul          Aug              Sept               Oct             Nov            Dec
                                                           #visitors



Source: 8pt, medium gray                                                                          TIME.mk Proprietary   17
Regional expansion - Slovenia




                  Next stop: Serbia
                                      TIME.mk Proprietary
Next to come …
• Search of the archive
• RSS feeds
• Click metrics & personalization
       adjustable cluster ranking to the user preferences

• News alerts
       emails with link to news that contain provided keywords

• Weekly and Monthly news threads
• New topics: Technology, Health, etc.
• Inclusion of other news sources (currently only 26)
• Automatic Hub discovery
• Improvements in the clustering algorithms (more sophisticated NLP)
       СДСМ = СДС, премиер = груевски, нафта = бензин, АМС = агенција за
       млади и спорт, струја = електрична енергија, etc.
       Go beyond duplicate detection by measuring new fact introduction

                                                                          TIME.mk Proprietary   19
Acknowledgments

- Pajo & Biba for registering TIME.mk in MARNET
- Karolina for offering DNS services and HTML/CSS tricks
- Igor (Zuljo) for implementing the new design
- Nikola and Daniel for implementing text extraction for TIMES.si
- many many users for suggesting improvements by sending tons of emails with bugs
  on TIME.mk pages




                                                                      TIME.mk Proprietary
Thank You!
Q&A




             TIME.mk Proprietary   21

More Related Content

More from Darko Buldioski

RealTIme Marketing - #LinkSarajevo
RealTIme Marketing - #LinkSarajevoRealTIme Marketing - #LinkSarajevo
RealTIme Marketing - #LinkSarajevoDarko Buldioski
 
Social Media, Blogs and Social Influence
Social Media, Blogs and Social InfluenceSocial Media, Blogs and Social Influence
Social Media, Blogs and Social InfluenceDarko Buldioski
 
Визуелна комункација Во реално време - Дарко Булдиоски @EngageMK
Визуелна комункација Во реално време - Дарко Булдиоски @EngageMKВизуелна комункација Во реално време - Дарко Булдиоски @EngageMK
Визуелна комункација Во реално време - Дарко Булдиоски @EngageMKDarko Buldioski
 
Навики за користење на електронските медиуми кај децата на предучилишна возра...
Навики за користење на електронските медиуми кај децата на предучилишна возра...Навики за користење на електронските медиуми кај децата на предучилишна возра...
Навики за користење на електронските медиуми кај децата на предучилишна возра...Darko Buldioski
 
Social Media Optimisation - short version
Social Media Optimisation - short versionSocial Media Optimisation - short version
Social Media Optimisation - short versionDarko Buldioski
 
Етика во онлајн медиуми
Етика во онлајн медиумиЕтика во онлајн медиуми
Етика во онлајн медиумиDarko Buldioski
 
Internet aplikacii razvoj marketig
Internet aplikacii razvoj marketigInternet aplikacii razvoj marketig
Internet aplikacii razvoj marketigDarko Buldioski
 
Internet Marketing In Macedonia
Internet Marketing In MacedoniaInternet Marketing In Macedonia
Internet Marketing In MacedoniaDarko Buldioski
 
Project Office.Net - marketing plan
Project Office.Net - marketing planProject Office.Net - marketing plan
Project Office.Net - marketing planDarko Buldioski
 
Internet Chance and Challenge Perspectives of Participation in Society
Internet Chance and ChallengePerspectives of Participation in SocietyInternet Chance and ChallengePerspectives of Participation in Society
Internet Chance and Challenge Perspectives of Participation in SocietyDarko Buldioski
 
Blogosphere in Macedonia
Blogosphere in MacedoniaBlogosphere in Macedonia
Blogosphere in MacedoniaDarko Buldioski
 

More from Darko Buldioski (16)

RealTIme Marketing - #LinkSarajevo
RealTIme Marketing - #LinkSarajevoRealTIme Marketing - #LinkSarajevo
RealTIme Marketing - #LinkSarajevo
 
Social Media, Blogs and Social Influence
Social Media, Blogs and Social InfluenceSocial Media, Blogs and Social Influence
Social Media, Blogs and Social Influence
 
Визуелна комункација Во реално време - Дарко Булдиоски @EngageMK
Визуелна комункација Во реално време - Дарко Булдиоски @EngageMKВизуелна комункација Во реално време - Дарко Булдиоски @EngageMK
Визуелна комункација Во реално време - Дарко Булдиоски @EngageMK
 
Навики за користење на електронските медиуми кај децата на предучилишна возра...
Навики за користење на електронските медиуми кај децата на предучилишна возра...Навики за користење на електронските медиуми кај децата на предучилишна возра...
Навики за користење на електронските медиуми кај децата на предучилишна возра...
 
Social Media Optimisation - short version
Social Media Optimisation - short versionSocial Media Optimisation - short version
Social Media Optimisation - short version
 
Етика во онлајн медиуми
Етика во онлајн медиумиЕтика во онлајн медиуми
Етика во онлајн медиуми
 
Web 2.0 / Social Media
Web 2.0 / Social MediaWeb 2.0 / Social Media
Web 2.0 / Social Media
 
Twitter 101
Twitter 101Twitter 101
Twitter 101
 
Internet aplikacii razvoj marketig
Internet aplikacii razvoj marketigInternet aplikacii razvoj marketig
Internet aplikacii razvoj marketig
 
Internet Marketing In Macedonia
Internet Marketing In MacedoniaInternet Marketing In Macedonia
Internet Marketing In Macedonia
 
Project Office.Net - marketing plan
Project Office.Net - marketing planProject Office.Net - marketing plan
Project Office.Net - marketing plan
 
Internet Chance and Challenge Perspectives of Participation in Society
Internet Chance and ChallengePerspectives of Participation in SocietyInternet Chance and ChallengePerspectives of Participation in Society
Internet Chance and Challenge Perspectives of Participation in Society
 
Blogosphere in Macedonia
Blogosphere in MacedoniaBlogosphere in Macedonia
Blogosphere in Macedonia
 
Pak
PakPak
Pak
 
DLD Servey Results
DLD Servey ResultsDLD Servey Results
DLD Servey Results
 
Blogs vs Journalism
Blogs vs JournalismBlogs vs Journalism
Blogs vs Journalism
 

Recently uploaded

Experience the Future of the Web3 Gaming Trend
Experience the Future of the Web3 Gaming TrendExperience the Future of the Web3 Gaming Trend
Experience the Future of the Web3 Gaming TrendFabwelt
 
Quiz for Heritage Indian including all the rounds
Quiz for Heritage Indian including all the roundsQuiz for Heritage Indian including all the rounds
Quiz for Heritage Indian including all the roundsnaxymaxyy
 
Manipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpkManipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpkbhavenpr
 
57 Bidens Annihilation Nation Policy.pdf
57 Bidens Annihilation Nation Policy.pdf57 Bidens Annihilation Nation Policy.pdf
57 Bidens Annihilation Nation Policy.pdfGerald Furnkranz
 
IndiaWest: Your Trusted Source for Today's Global News
IndiaWest: Your Trusted Source for Today's Global NewsIndiaWest: Your Trusted Source for Today's Global News
IndiaWest: Your Trusted Source for Today's Global NewsIndiaWest2
 
VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012ankitnayak356677
 
16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
complaint-ECI-PM-media-1-Chandru.pdfra;;prfk
complaint-ECI-PM-media-1-Chandru.pdfra;;prfkcomplaint-ECI-PM-media-1-Chandru.pdfra;;prfk
complaint-ECI-PM-media-1-Chandru.pdfra;;prfkbhavenpr
 
Rohan Jaitley: Central Gov't Standing Counsel for Justice
Rohan Jaitley: Central Gov't Standing Counsel for JusticeRohan Jaitley: Central Gov't Standing Counsel for Justice
Rohan Jaitley: Central Gov't Standing Counsel for JusticeAbdulGhani778830
 
Global Terrorism and its types and prevention ppt.
Global Terrorism and its types and prevention ppt.Global Terrorism and its types and prevention ppt.
Global Terrorism and its types and prevention ppt.NaveedKhaskheli1
 

Recently uploaded (10)

Experience the Future of the Web3 Gaming Trend
Experience the Future of the Web3 Gaming TrendExperience the Future of the Web3 Gaming Trend
Experience the Future of the Web3 Gaming Trend
 
Quiz for Heritage Indian including all the rounds
Quiz for Heritage Indian including all the roundsQuiz for Heritage Indian including all the rounds
Quiz for Heritage Indian including all the rounds
 
Manipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpkManipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpk
 
57 Bidens Annihilation Nation Policy.pdf
57 Bidens Annihilation Nation Policy.pdf57 Bidens Annihilation Nation Policy.pdf
57 Bidens Annihilation Nation Policy.pdf
 
IndiaWest: Your Trusted Source for Today's Global News
IndiaWest: Your Trusted Source for Today's Global NewsIndiaWest: Your Trusted Source for Today's Global News
IndiaWest: Your Trusted Source for Today's Global News
 
VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012
 
16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf
 
complaint-ECI-PM-media-1-Chandru.pdfra;;prfk
complaint-ECI-PM-media-1-Chandru.pdfra;;prfkcomplaint-ECI-PM-media-1-Chandru.pdfra;;prfk
complaint-ECI-PM-media-1-Chandru.pdfra;;prfk
 
Rohan Jaitley: Central Gov't Standing Counsel for Justice
Rohan Jaitley: Central Gov't Standing Counsel for JusticeRohan Jaitley: Central Gov't Standing Counsel for Justice
Rohan Jaitley: Central Gov't Standing Counsel for Justice
 
Global Terrorism and its types and prevention ppt.
Global Terrorism and its types and prevention ppt.Global Terrorism and its types and prevention ppt.
Global Terrorism and its types and prevention ppt.
 

Time.Mk

  • 1. A New Approach to News Prof. Igor Trajkovski, Ph.D. 1
  • 2. Motivation – Traditionally, news readers first pick a publication and then look for headlines that interest them. TIME.mk Proprietary 2
  • 3. News Pipeline Crawling, Extraction Clustering Story Classification Scoring TIME.mk Proprietary
  • 4. Crawling and Extraction TIME.mk Proprietary 4
  • 5. Crawl • Most of the Macedonian news sites don’t have RSS feeds. • One level crawl from a set of hubs: (Macedonia) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1 (Economy) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2 (Sport) http://www.forum.com.mk/index.php?option=com_content&task=blogsection&id=9&Itemid=100 (Culture) http://novamakedonija.com.mk/DesktopDefault.aspx?tabindex=1&tabid=2&CategID=26 (None) http://www.kirilica.com.mk/ • Many hubs per source. Some of the sources have fixed address of the hubs (A1, Makfax, etc.), for some we need to determine the hub addresses in runtime (Dnevnik, Utrinski, etc.) • Hubs annotated with section name (topic): Macedonia, Balkan, World, Economy, Culture, Fun, Sport, Chronic, None • At the moment, hubs of the sources are provided manually. TIME.mk Proprietary
  • 6. Article Extraction Segment into title / body / image Heuristics: • Title matches link text and/or HTML title, and is above the body • Body is a big run of unformatted Cyrillic text, below title • Image is extracted from the hub page and has attached link with exactly the same address as the article The same procedure is used for extracting all articles from all sources !!! TIME.mk Proprietary
  • 7. Clustering TIME.mk Proprietary 7
  • 8. Clustering Partition news articles into disjoint subsets of clusters, such that: News within a cluster are very similar News in different clusters are very different .. . .. . . . . . . . . . . TIME.mk Proprietary
  • 9. Word weights Weight is function of word frequency within a document and across all documents TF(w) = frequency of word w in a news article • Intuition: a word appearing more frequently in a text is more likely to be related to its “meaning” IDF(w) = log [N/nw] + 1 • where N = #news articles, nw is #news articles containing w • Intuition: words appearing in many news articles are generally not very informative (e.g., “и”, “е”, “на”, “не”, “со”, “во”, “нели”, “затоа”, etc.) TFIDF: weight of a word in a news article is product of these quantities: • TFIDF(w) = TF(w) x IDF(w) A1, 17:15h, MK Кривична пријава против Андреј Петров петров(28.699) пивара(25.589) андреј(23.382) комиси(23.15) мвр(20.603) кривич(16.449) дистри(16.036) сдсм(15.714) незако(14.921) жалела(14.482)… TIME.mk Proprietary
  • 10. Story Classification TIME.mk Proprietary 10
  • 11. Story Classification Based on hub classification tags cluster news cluster news Macedonia Culture Macedonia Fun Region Macedonia Macedonia Culture Macedonia Culture NONE Macedonia TIME.mk Proprietary
  • 12. Scoring TIME.mk Proprietary 12
  • 13. Cluster Scoring Logic Cluster Score = quality-of-sources * freshness-of-news Quality of a source: How useful is this source? - Non-dup fraction - Participation in large stories - First publisher of a top story TIME.mk Proprietary
  • 14. Article Scoring Logic Article Score – Used for ranking within a cluster – Function of: • Age • Quality of source • Title overlap with cluster centroid • Article size • … TIME.mk Proprietary
  • 15. Stats and Future Work TIME.mk Proprietary 15
  • 16. УТ Р НО ИН ДН Е 0 20 40 60 80 100 120 140 160 180 200 ВА СКИ ВН М В И АК Е С К ЕД Н О ИК Н И ВР ЈА ЕМ ВЕ Е ЧЕ Р КА А1 М Н АК АЛ5 Ф НЕ АК ТП С РЕ С КИ ВЕ News sources activity РИ С ЛИ Т Ц СИ А А Л ТЕ С Л АТ Ф -М О Р АЛ УМ Ф А Т КУ В ИД РИ И Р ВИ ДИ К А МТ ЈГ В А БР НА М ОК АК Е ДЕ Р НЕ С SE BB TI C M O ES ЗА N .N ЗА ET * in period of 2 working days БА В ТЕ А ЛМ А TIME.mk Proprietary 16
  • 17. Visitors 365.com.mk started to present TIME.mk news 7000 6500 6000 ON.net started to Lunch of present TIME.mk news 5000 TIME.mk 4500 1.July.2008 4000 discussions on 3000 MK forums Article about 2000 TIME.mk in 1500 Нова Македонија 1000 700 100 180 0 Jul Aug Sept Oct Nov Dec #visitors Source: 8pt, medium gray TIME.mk Proprietary 17
  • 18. Regional expansion - Slovenia Next stop: Serbia TIME.mk Proprietary
  • 19. Next to come … • Search of the archive • RSS feeds • Click metrics & personalization adjustable cluster ranking to the user preferences • News alerts emails with link to news that contain provided keywords • Weekly and Monthly news threads • New topics: Technology, Health, etc. • Inclusion of other news sources (currently only 26) • Automatic Hub discovery • Improvements in the clustering algorithms (more sophisticated NLP) СДСМ = СДС, премиер = груевски, нафта = бензин, АМС = агенција за млади и спорт, струја = електрична енергија, etc. Go beyond duplicate detection by measuring new fact introduction TIME.mk Proprietary 19
  • 20. Acknowledgments - Pajo & Biba for registering TIME.mk in MARNET - Karolina for offering DNS services and HTML/CSS tricks - Igor (Zuljo) for implementing the new design - Nikola and Daniel for implementing text extraction for TIMES.si - many many users for suggesting improvements by sending tons of emails with bugs on TIME.mk pages TIME.mk Proprietary
  • 21. Thank You! Q&A TIME.mk Proprietary 21