Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion




                             ...
Introduction      Crawler   Clustering    Classification   Scoring   Results   Conclusion


Outline

       1       Introdu...
Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion


Motivation
       Traditionally...
Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion


Google Approach: news.google.co...
Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion


Google Approach: www.time.mk


...
Introduction      Crawler   Clustering    Classification   Scoring   Results   Conclusion


Outline

       1       Introdu...
Introduction    Crawler      Clustering     Classification      Scoring      Results    Conclusion


Crawling links

      ...
Introduction    Crawler    Clustering     Classification    Scoring     Results      Conclusion


WWW “jungle”




       S...
Introduction    Crawler    Clustering     Classification   Scoring   Results   Conclusion


Crawler statistics




        ...
Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion


Information extraction




Traj...
Introduction      Crawler   Clustering    Classification   Scoring   Results   Conclusion


Outline

       1       Introdu...
Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion


Clustering problem




Trajkovs...
Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion


Clustering problem




Trajkovs...
Introduction    Crawler         Clustering      Classification   Scoring     Results   Conclusion


Keywords extraction

  ...
Introduction    Crawler    Clustering     Classification   Scoring   Results   Conclusion


Keywords extraction (cont.)



...
Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion


Hierarchical Agglomerative Clus...
Introduction      Crawler   Clustering    Classification   Scoring   Results   Conclusion


Outline

       1       Introdu...
Introduction    Crawler    Clustering     Classification   Scoring   Results   Conclusion


News story classification

     ...
Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion


News story classification using ...
Introduction      Crawler   Clustering    Classification   Scoring   Results   Conclusion


Outline

       1       Introdu...
Introduction     Crawler      Clustering      Classification     Scoring       Results    Conclusion


General properties f...
Introduction    Crawler      Clustering   Classification   Scoring   Results     Conclusion


TIME.mk ranking algorithm
   ...
Introduction      Crawler   Clustering    Classification   Scoring   Results   Conclusion


Outline

       1       Introdu...
Introduction     Crawler     Clustering     Classification     Scoring     Results   Conclusion


Dataset



              ...
Introduction     Crawler      Clustering      Classification     Scoring      Results   Conclusion


Classification of News ...
Introduction   Crawler     Clustering     Classification   Scoring   Results   Conclusion


Visits statistics




Trajkovsk...
Introduction      Crawler   Clustering    Classification   Scoring   Results   Conclusion


Outline

       1       Introdu...
Introduction    Crawler    Clustering     Classification   Scoring   Results   Conclusion


Conclusion and future work

   ...
Upcoming SlideShare
Loading in …5
×

Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

1,728 views

Published on

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,728
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

  1. 1. Introduction Crawler Clustering Classification Scoring Results Conclusion Computer generated news site TIME.mk Dr Igor Trajkovski New York University Skopje Skopje, Macedonia 16.10.2009 Trajkovski: Computer generated news site - TIME.mk 1 / 28
  2. 2. Introduction Crawler Clustering Classification Scoring Results Conclusion Outline 1 Introduction 2 Crawler 3 Clustering 4 Classification 5 Scoring 6 Results 7 Conclusion Trajkovski: Computer generated news site - TIME.mk 2 / 28
  3. 3. Introduction Crawler Clustering Classification Scoring Results Conclusion Motivation Traditionally, news readers first pick a medium and then look for headlines that interest them. NEW: Pick a story, then read what mediums wrote about it Trajkovski: Computer generated news site - TIME.mk 3 / 28
  4. 4. Introduction Crawler Clustering Classification Scoring Results Conclusion Google Approach: news.google.com Trajkovski: Computer generated news site - TIME.mk 4 / 28
  5. 5. Introduction Crawler Clustering Classification Scoring Results Conclusion Google Approach: www.time.mk Trajkovski: Computer generated news site - TIME.mk 5 / 28
  6. 6. Introduction Crawler Clustering Classification Scoring Results Conclusion Outline 1 Introduction 2 Crawler 3 Clustering 4 Classification 5 Scoring 6 Results 7 Conclusion Trajkovski: Computer generated news site - TIME.mk 6 / 28
  7. 7. Introduction Crawler Clustering Classification Scoring Results Conclusion Crawling links Many Macedonian news sites don’t have RSS feeds. Crawling hubs: - Macedonia http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1 - Economy http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2 - Culture http://novamakedonija.com.mk/DesktopDefault.aspx?tabid=2&CategID=26 - None www.bbc.co.uk/macedonian/ Many hubs per source - fixed addresses of the hubs (A1, Makfax, etc.) - dynamic addresses of the hubs (Dnevnik, Utrinski, Sitel, etc.) Hubs already classified into predefined topics: - Macedonia, Balkan, World, Economy, Skopje, Culture, Technology, Fun, LifeStyle, ShowBiz, Sport, Chronic, None At the moment, hubs of the sources are provided manually. Trajkovski: Computer generated news site - TIME.mk 7 / 28
  8. 8. Introduction Crawler Clustering Classification Scoring Results Conclusion WWW “jungle” Some of the issues: News articles change their title (A1, BBC, Netpress, etc.) News articles change their address (Sitel, Dnevnik, etc.) Corrupted HTML pages (need for development of cusomized UTF-8 decoder) News articles from several topics appear on the same page Trajkovski: Computer generated news site - TIME.mk 8 / 28
  9. 9. Introduction Crawler Clustering Classification Scoring Results Conclusion Crawler statistics 100 parallel crawlers (threads) running on a quad core machine 50 news sources (+3 per month), 350 hubs 4500 live news articles (archives not included) 1400 new news articles (10-15% duplicates) Crawling of all hubs takes 1-2 minutes Refresh rate: 10 minutes Trajkovski: Computer generated news site - TIME.mk 9 / 28
  10. 10. Introduction Crawler Clustering Classification Scoring Results Conclusion Information extraction Trajkovski: Computer generated news site - TIME.mk 10 / 28
  11. 11. Introduction Crawler Clustering Classification Scoring Results Conclusion Outline 1 Introduction 2 Crawler 3 Clustering 4 Classification 5 Scoring 6 Results 7 Conclusion Trajkovski: Computer generated news site - TIME.mk 11 / 28
  12. 12. Introduction Crawler Clustering Classification Scoring Results Conclusion Clustering problem Trajkovski: Computer generated news site - TIME.mk 12 / 28
  13. 13. Introduction Crawler Clustering Classification Scoring Results Conclusion Clustering problem Trajkovski: Computer generated news site - TIME.mk 13 / 28
  14. 14. Introduction Crawler Clustering Classification Scoring Results Conclusion Keywords extraction The weight vector for document d is: vd = [w1,d , w2,d , ..., wK ,d ]T where: |D| wt,d = tft · log |Dt | and: K is the number of distinct words occurring in all news articles. |Dt | is the number of news articles containing the term t, and tft is term frequency of term t in document d (a local parameter) |D| log |Dt | is inverse document frequency (a global parameter), where |D| is the total number of news articles The model is known as Term Frequency-Inverse Document Frequency (TF-IDF) model. Trajkovski: Computer generated news site - TIME.mk 14 / 28
  15. 15. Introduction Crawler Clustering Classification Scoring Results Conclusion Keywords extraction (cont.) News article is represented with the top Nkw keywords that have largest weight. The weight vector vd is normalized, vd = 1. Computation of news articles similarity by calculating the angle between news articles vectors: v1 · v2 cos θ = v1 v2 Trajkovski: Computer generated news site - TIME.mk 15 / 28
  16. 16. Introduction Crawler Clustering Classification Scoring Results Conclusion Hierarchical Agglomerative Clustering (HAC) Trajkovski: Computer generated news site - TIME.mk 16 / 28
  17. 17. Introduction Crawler Clustering Classification Scoring Results Conclusion Outline 1 Introduction 2 Crawler 3 Clustering 4 Classification 5 Scoring 6 Results 7 Conclusion Trajkovski: Computer generated news site - TIME.mk 17 / 28
  18. 18. Introduction Crawler Clustering Classification Scoring Results Conclusion News story classification Problem: Most news sources don’t have detailed classifications of news articles as TIME.mk have Example: Balkan articles published in section World Technology, ShowBiz, LifeStyle in Fun, Skopje in Macedonia, etc. also, they use different topic names from the TIME.mk’s topic names: Example: Region for Balkan Trends for Culture, Future for Technology, etc. Trajkovski: Computer generated news site - TIME.mk 18 / 28
  19. 19. Introduction Crawler Clustering Classification Scoring Results Conclusion News story classification using ontology Let N1 is the #articles from the most freq. class C1 Let N2 is the #articles from the second most freq. class C2  C1 , if C1 is more specific or no relation withC2  Class = C2 , if C2 is more specific thanC1 and N1 ≤ 2 · N2  C1 , if C2 is more specific thanC1 and N1 > 2 · N2  Trajkovski: Computer generated news site - TIME.mk 19 / 28
  20. 20. Introduction Crawler Clustering Classification Scoring Results Conclusion Outline 1 Introduction 2 Crawler 3 Clustering 4 Classification 5 Scoring 6 Results 7 Conclusion Trajkovski: Computer generated news site - TIME.mk 20 / 28
  21. 21. Introduction Crawler Clustering Classification Scoring Results Conclusion General properties for any news ranking algorithm Any ranking algorithm for news stories (clusters) should have at least the following four properties: Time awareness. The importance of a piece of news changes over the time. Important News Story is covered by many news articles. The weighted size of the cluster is a measure of its importance. Authority of the sources. BBC is more authoritative than Kirilica. Diversity. News story reported by three different news sources is more important than news story of three articles published by one source. Trajkovski: Computer generated news site - TIME.mk 21 / 28
  22. 22. Introduction Crawler Clustering Classification Scoring Results Conclusion TIME.mk ranking algorithm Weight of the cluster c at moment t: k WC (c, t) = SourceEntropy (c) · WN(ni , t) i =1 where: k is the size of the cluster c. ni are the news articles of cluster c. SourceEntropy (c) represents the entropy of the news sources. WN(ni , t) the weigth of news article ni at time t which has been published at time ti : WN(ni , t) = A(source(ni )) · e −α(t−ti ) , t > ti A(s) accounts for the authority of the source. One source is more authoritive if it is more cited than other sources. Trajkovski: Computer generated news site - TIME.mk 22 / 28
  23. 23. Introduction Crawler Clustering Classification Scoring Results Conclusion Outline 1 Introduction 2 Crawler 3 Clustering 4 Classification 5 Scoring 6 Results 7 Conclusion Trajkovski: Computer generated news site - TIME.mk 23 / 28
  24. 24. Introduction Crawler Clustering Classification Scoring Results Conclusion Dataset Category #news Category #news Macedonia 58,596 Sport 43,498 Balkan 19,341 Chronicle 10,108 World 29,647 Culture 16,933 Economy 29,754 Technology 5,755 Skopje 7,325 Fun/Showbiz 37,798 Uncategorized 58205 Total 316960 Collected over a period of one year (from 01/07/08 to 30/06/09). Trajkovski: Computer generated news site - TIME.mk 24 / 28
  25. 25. Introduction Crawler Clustering Classification Scoring Results Conclusion Classification of News Stories Ontology majority voting algorithm for cluster classification. Category Precision Recall Category Precision Recall Macedonia 95% 98% Sport 98% 96% Balkan 94% 96% Chronicle 98% 98% World 98% 98% Culture 98% 96% Economy 97% 96% Technology 98% 98% Skopje 98% 96% Fun/Showbiz 98% 97% 100 (per category) manually labeled clusters was used as an evaluation set. Trajkovski: Computer generated news site - TIME.mk 25 / 28
  26. 26. Introduction Crawler Clustering Classification Scoring Results Conclusion Visits statistics Trajkovski: Computer generated news site - TIME.mk 26 / 28
  27. 27. Introduction Crawler Clustering Classification Scoring Results Conclusion Outline 1 Introduction 2 Crawler 3 Clustering 4 Classification 5 Scoring 6 Results 7 Conclusion Trajkovski: Computer generated news site - TIME.mk 27 / 28
  28. 28. Introduction Crawler Clustering Classification Scoring Results Conclusion Conclusion and future work This work has been motivated by the large usage of news engines versus the lack of academic papers in this area. The design and implementation details of a full scale news engine was presented. A model for scoring news articles and news stories was presented. An extensive testing on more than 300,000 news articles, posted by 50 sources over one year, has been performed, showing very encouraging results. Future work: Scoring of news articles and news stories. New “semantic” model for representing news is needed to replace current “bag of words” model. Trajkovski: Computer generated news site - TIME.mk 28 / 28

×