Your SlideShare is downloading. ×
Identifying news clusters using Q-analysis and Modularity
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Identifying news clusters using Q-analysis and Modularity

202

Published on

With online publication and social media taking the main role in dissemination of news, and with the decline of traditional printed media, it has become necessary to devise ways to automatically …

With online publication and social media taking the main role in dissemination of news, and with the decline of traditional printed media, it has become necessary to devise ways to automatically extract meaningful information from the plethora of sources available and to make that information readily available to interested parties. In this paper we present a method of automated analysis of the underlying structure of online newspapers based on Q-analysis and modularity. We show how the combination of the two strategies allows for the identification of well defined news clusters that are free of noise (unrelated stories) and provide automated clustering of information on trending topics on news published online.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
202
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Iden%fying  news  clusters  using   Q-­‐analysis  and  Modularity   David  Rodrigues+   Centre  for  Complexity  and  Design   +The Open University, UK – david.rodrigues@open.ac.uk   1  
  • 2. v   complexityanddesign.com   Thursday  am  –  Room  S11   2   Complexity  &  Design  Workshop  at  ECCS13  
  • 3. Mo%va%on   •  Find  Structure  in  collec%ons  of  text  documents   •  Create  Computer  Algorithms  to  automate  this   discovery  with  minimal  human  supervision.   •  Use  of  hybrid  methodologies  to  improve  quality  of   results   –  Topology  based  approach  describes  data   –  Clustering  technique  to  iden%fy  modules   3  
  • 4. Problem  Descrip%on   •  Iden%fy  the  Structure  of  the  news  published   online  by  The  Guardian  (among  other   newspapers)   – Clustering?     – Topology?   – Topic  Modelling?   – Noise?   – Novelty?   – Change?   4   [Kohut,  A.  and  Remez,  M.  (2008)]  
  • 5. Clustering  Techniques  in  Topic   Modelling   •  Nearest  neighbour  classifica%on   •  Bayesian  probabilis%c  techniques   •  Decision  trees   •  Regression  Models   •  Neural  Networks   •  Support  Vector  Machines   •  Language  dependent  /  Human  interven%on  in  the   defini%on  of  categories  for  training  samples.   5  
  • 6. Clustering  in  Graphs  is  Community   Detec%on   •  Modularity  based  techniques  [majority]   •  Spectral  algorithms   •  Synchroniza%on  based  techniques   •  …     •  [Community  detecBon  in  graphs  -­‐  Fortunato,  2010,  for  comprehensive  review]   •  Binary  rela%ons  between  nodes  don’t  capture   the  mul%-­‐level  structure  of  exis%ng  rela%ons.   – Move  to  n-­‐ary  rela%ons  and  descrip%ons   6  
  • 7. Previously   •  We  used  a  sliding  window  over  the  %me  series   of  the  news  stories   •  Used  Varia%on  of  Informa%on  to  measure   changes  in  an  evolving  adap%ve  network  of   news[Meilã  2007,  Rodrigues  2010]   7  
  • 8. Our  Proposal   •  Use  a  high  dimensional  representa%on  of  the   documents  (Simplicial  Complex)   •  Use  Q-­‐analysis  to  describe  the  system   constructed  from  the  Documents  x  Tags   Incidence  Matrix   •  Use  Q-­‐connected  components  to  filter  noise.   •  Use  modularity  opBmisaBon  to  find   communi%es  in  the  resul%ng  induced  graphs   8  
  • 9. Noise?   •  In  the  news  context,  we  define  noise  news  as   news  that  are  loosely  related  to  the  main   topics  published.   •  We  can  filter  them  by  assuming  that  the  Q-­‐ connectedness  of  this  news  is  very  low.     9  
  • 10. The  Guardian   •  Classifies  news  with  useful  metadata:   –  …   –  Sec%on   –  Tags   –  …   hkp://www.theguardian.com/open-­‐plalorm   Open  Plalorm  with  API  for  applica%on  development.     3  years  of  data:  2010,  2011  and  2012   10  
  • 11. Pseudo  code  for  the  automated  news   clustering  and  filtering  algorithm   11  
  • 12. Pseudo  code  for  the  automated  news   clustering  and  filtering  algorithm   12  
  • 13. Incidence  Matrix   TAG  1   TAG  2   TAG  3   TAG  4    TAG  5     …   NEWS  1   1   1   0   0   0   …   NEWS  2   0   1   1   0   1   …   NEWS  3   0   1   0   0   1   …   NEWS  4   1   0   0   0   1   …   NEWS  5   0   0   0   1   1   …   …   …   …   …   …   …   …   13   Documents  x  Tags  
  • 14. Results   14  
  • 15. Community  detec%on  on  the     0-­‐connected  graph   15   1  Month  of  News  –   November  2011     Modularity  =  0.48     9  communi%es  
  • 16. Small  frac%on  of  ver%ces  is  highly   connected   16  
  • 17. Giant  component  only  for  low   connected  graph   17  
  • 18. Modularity  vs.  connectedness   18  
  • 19. Number  of  nodes  decreases  quickly   with  Q   19  
  • 20. Number  of  nodes  and  Edge  Density   20  November  2011  
  • 21. Average  Clustering  and  Degree   Assorta%vity   21  
  • 22. n.  Components  and  Modularity   22  
  • 23. Q=5  +  Modularity   23  
  • 24. Examples  Of  Clusters  (I)   24  
  • 25. Examples  Of  Clusters  (II)   25  
  • 26. Developed  Tools   •  Theseus  –  A  python  applica%on  for  collec%ng,     processing  and  visualisa%on  of  the  textual   dataset  -­‐  hkps://github.com/sixhat/theseus     •  Visualisa%on  tool     26  
  • 27. Visualisa%on  Tool   27  
  • 28. Conclusions   •  Q-­‐analysis  gives  an  descrip%ve  overview  of  the   structure  of  the  system,  it  terms  of  the  local   connec%vity  of  the  news  stories.   •  Clustering  (on  top  of  the  Q-­‐analysis)  gives  a   natural  (highly  modular)  division  of  the   resul%ng  structures.     •  This  allows  the  iden%fica%on  of  coherent  news   cluster  and  the  filtering  of  noise  news.   28  
  • 29. Generalisa%on  of  applicability   •  Instead  of  Human  tagged  documents,  one  can   apply  this  to  any  kind  of  text  based   documents:   – HTML  Webpages:  Use  keywords  tag  from  header     •  or   – Extract  keywords  with  topic  modelling  (LDA,  for   example)   – Scien%fic  Documents:  Tag  documents  with  topic   modelling  strategies  like  LDA  and  instead  of  noise,   explore  the  possibility  that  low  connected  stories   might  be  emerging  scien%fic  trends.   29  
  • 30. Take  home  message   •  Real  Complex  Systems  are  mul%-­‐dimensional.   Community  detec%on  methods  need  to  take   into  account  those  descrip%ons   •  The  construc%on  of  descrip%ons  with  all  the   rela%ons  (hyper-­‐simplicies)  gives  beker   qualita%ve  of  the  results   •  In  the  newspapers  case,  this  helps  the  filtering   of  ``noise’’  news  (unrelated  news).   30  

×