SlideShare a Scribd company logo
M. Platakis1, D. Kotsakos1, D. Gunopulos1

1Department   of Informatics and Telecommunications, University of Athens

                   {platakis, d.kotsakos, dg}@di.uoa.gr
   Introduction

   Related Work

   Our Approach

   Experimental Results

   Future Work
   What is a blog?

   Why research on blogs?

    ◦ 200K new blogs every day

    ◦ 120M English blogs

                      Blogosphere
   What is a blog?

   Why research on blogs?

    ◦ 200K new blogs every day

    ◦ 120M English blogs

                  Blogosphere
   What is a blog?

   Why research on blogs?

    ◦ 200K new blogs every day

    ◦ 120M English blogs


       Blogosphere
   The Problem: Searching in blogs



   Blogs are an instant medium. When something
    important happens bloggers write about it



   Need for an automated searching/analyzing
    mechanism
   Each blog post has a timestamp
    ◦ Not present in traditional web content
    ◦ Not taken into account by                etc.

   Popularity of a term may increase in time

   This is a burst!

   In our work we detect bursty terms and
    correlations between them…
    …or at least we try to!
This is a burst!
   Introduction

   Related Work

   Our Approach

   Experimental Results

   Future Work
   J. Kleinberg’s attempt to find meaningful
    structure in e-mail streams


   State transitions mark the bursts in an infinite
    state probabilistic automaton


   Costs assigned to state transitions eliminate
    short bursts and fragmentation of longer
    ones
   HMM Theory

   Given an event stream a minimum-cost state
    sequence is found (dynamic programming)
   [Bansal et al. (2007)] present Blogscope.net – a
    system that identifies keyword bursts



   A standard normal curve, formed by the popularity
    values of a term for the last few days, is used to
    find outliers



   Analysis performed on the body of blog posts
   [Vlachos et al. (2004)]: finding similarities between
    time-series
   [Kumar et al. (2005)], [Chi et al. (2007)]: further
    analysis of the Blogosphere
   [Wang et al. (2007)]: mining of correlated bursty
    topic patterns
   [Zhang et al. (2007)]: detecting bursts
   [Agarwal (2008)]: identifying influential bloggers
   [Cheng et al. (2008)]: discovering bloggers’
    interests
   [Ding et al. (2008)]: opinion mining
   Introduction

   Related Work

   Our Approach

   Experimental Results

   Future Work
1.   Application on blog data



2.   Bursty terms discovery



3.   Accuracy evaluation: matching the results
     with real-life events (if possible)
4.   Correlated keywords usually form a topic;
     our goal is to identify such topics



5.   Burstiness = when + how much
     Similar burstiness  potential term correlation



6.   Accuracy evaluation: matching obtained groups
     of keywords with real-life topics (if possible)
   Why use Kleinberg’s algorithm?
    ◦ Sophisticated notion of temporal bursts
    ◦ Blogscope’s team claim it would be too slow

   We experiment on sole blog posts’ titles
    ◦ To determine if titles are suffice to extract
      meaningful bursts
    ◦ To gain some computational time

   Titles had to be parsed through the
    permalinks of the blog posts
   Introduction

   Related Work

   Our Approach

   Experimental Results

   Future Work
   Bursty keywords examples in comparison
    with Blogscope.net

   Dataset from the first ten days of April 2008
    ◦ 30,000 blog post titles

   Data pre-processing
    ◦ stop words removal, downcase, no stemming

   Automaton’s output (comes in real time!):
       “Keyword, weight, start_day, end_day”
   “Fool” comes out as a bursty term
   Output: “fool,17.343,1,2”
   Thus, “fool” was bursty on April’s first two days
   “Fool” comes out as a bursty term
   Output: “fool,17.343,1,2”
   Thus, “fool” was bursty on April’s first two days
   “preprocessor” comes out as a bursty term
   Output: “preprocessor,85.947,6,6”
   Thus, “preprocessor” was bursty on April’s 6th
    day
   “preprocessor” comes out as a bursty term
   Output: “preprocessor,85.947,6,6”
   Thus, “preprocessor” was bursty on April’s 6th
    day



   Numerous spam blog posts were uploaded on
    April’s 6th day

   All contained the term “preprocessor”
   Searching through the results to identify
    clusters of correlated keywords
   An example: Keyword Weight Start day End day
                  heston     12.213   5   5
                  charlton   9.929    5   5
                  actor      4.302    4   5
                  dead       3.388    5   5
                  84         3.08     5   5




   Bloggers wrote about:
   Introduction

   Related Work

   Our Approach

   Experimental Results

   Future Work
   Tags + Titles + Multimedia metadata
   Development of an automated method to
    discover correlated keywords
   Extraction of popularity curve per keyword
   Employment of an appropriate distance metric
    (time series’ analysis) to compare such curves
   Application in micro-blogging and continuous
    news’ streams
Discovering Hot Topics in the Blogosphere

More Related Content

Viewers also liked

Proyecto De InvestigacióN
Proyecto De InvestigacióNProyecto De InvestigacióN
Proyecto De InvestigacióN
Carlos Suarez Orellana
 
Aaron7nov
Aaron7novAaron7nov
Aaron7nov
Urbina_A
 
ComparacióN Entre El Liceos Municipalizados
ComparacióN Entre El Liceos MunicipalizadosComparacióN Entre El Liceos Municipalizados
ComparacióN Entre El Liceos Municipalizadosveroaviles
 
Anders Damborg
Anders DamborgAnders Damborg
Anders DamborgA1Damborg
 
Propuesta LúDica Pp
Propuesta LúDica PpPropuesta LúDica Pp
Propuesta LúDica Ppregina1202
 
It seems Impossible
It seems ImpossibleIt seems Impossible
It seems Impossible
Angela3365
 

Viewers also liked (7)

Proyecto De InvestigacióN
Proyecto De InvestigacióNProyecto De InvestigacióN
Proyecto De InvestigacióN
 
Aaron7nov
Aaron7novAaron7nov
Aaron7nov
 
ComparacióN Entre El Liceos Municipalizados
ComparacióN Entre El Liceos MunicipalizadosComparacióN Entre El Liceos Municipalizados
ComparacióN Entre El Liceos Municipalizados
 
Anders Damborg
Anders DamborgAnders Damborg
Anders Damborg
 
Propuesta LúDica Pp
Propuesta LúDica PpPropuesta LúDica Pp
Propuesta LúDica Pp
 
It seems Impossible
It seems ImpossibleIt seems Impossible
It seems Impossible
 
Blogs
BlogsBlogs
Blogs
 

Similar to Discovering Hot Topics in the Blogosphere

PostgreSQL Conference: West 08
PostgreSQL Conference: West 08PostgreSQL Conference: West 08
PostgreSQL Conference: West 08
Joshua Drake
 
Final Presentation V3
Final Presentation V3Final Presentation V3
Final Presentation V3weichen
 
Question Classifier
Question ClassifierQuestion Classifier
Question Classifier
Jennifer Lee
 
Progressive Enhancement with JavaScript and Ajax
Progressive Enhancement with JavaScript and AjaxProgressive Enhancement with JavaScript and Ajax
Progressive Enhancement with JavaScript and Ajax
Christian Heilmann
 
Windy cityrails performance_tuning
Windy cityrails performance_tuningWindy cityrails performance_tuning
Windy cityrails performance_tuning
John McCaffrey
 
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
John McCaffrey
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Flip Kromer
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
HostedbyConfluent
 
Deep Learning Automated Helpdesk
Deep Learning Automated HelpdeskDeep Learning Automated Helpdesk
Deep Learning Automated Helpdesk
Pranav Sharma
 
Using H2O AutoML for Kaggle Competitions
Using H2O AutoML for Kaggle CompetitionsUsing H2O AutoML for Kaggle Competitions
Using H2O AutoML for Kaggle Competitions
Sri Ambati
 
Demo day
Demo dayDemo day
Demo day
DeepikaRana30
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
ISSEL
 
Coaching teams in creative problem solving
Coaching teams in creative problem solvingCoaching teams in creative problem solving
Coaching teams in creative problem solving
Flowa Oy
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
Shree Shree
 
How To Write A SQL Server Performance Review
How To Write A SQL Server Performance ReviewHow To Write A SQL Server Performance Review
How To Write A SQL Server Performance Review
Quest Software
 
Minerva-Master-Thesis-Template-Minimalist-by-Slidecore-rwbiuv.pptx
Minerva-Master-Thesis-Template-Minimalist-by-Slidecore-rwbiuv.pptxMinerva-Master-Thesis-Template-Minimalist-by-Slidecore-rwbiuv.pptx
Minerva-Master-Thesis-Template-Minimalist-by-Slidecore-rwbiuv.pptx
Zakaria156221
 
Building a JavaScript Library
Building a JavaScript LibraryBuilding a JavaScript Library
Building a JavaScript Library
jeresig
 
Goodle Developer Days Madrid 2008 - Open Social Update
Goodle Developer Days Madrid 2008 - Open Social UpdateGoodle Developer Days Madrid 2008 - Open Social Update
Goodle Developer Days Madrid 2008 - Open Social Update
Patrick Chanezon
 

Similar to Discovering Hot Topics in the Blogosphere (20)

PostgreSQL Conference: West 08
PostgreSQL Conference: West 08PostgreSQL Conference: West 08
PostgreSQL Conference: West 08
 
Final Presentation V3
Final Presentation V3Final Presentation V3
Final Presentation V3
 
Qure Tech Presentation
Qure Tech PresentationQure Tech Presentation
Qure Tech Presentation
 
Question Classifier
Question ClassifierQuestion Classifier
Question Classifier
 
Progressive Enhancement with JavaScript and Ajax
Progressive Enhancement with JavaScript and AjaxProgressive Enhancement with JavaScript and Ajax
Progressive Enhancement with JavaScript and Ajax
 
Windy cityrails performance_tuning
Windy cityrails performance_tuningWindy cityrails performance_tuning
Windy cityrails performance_tuning
 
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Deep Learning Automated Helpdesk
Deep Learning Automated HelpdeskDeep Learning Automated Helpdesk
Deep Learning Automated Helpdesk
 
Using H2O AutoML for Kaggle Competitions
Using H2O AutoML for Kaggle CompetitionsUsing H2O AutoML for Kaggle Competitions
Using H2O AutoML for Kaggle Competitions
 
Demo day
Demo dayDemo day
Demo day
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Coaching teams in creative problem solving
Coaching teams in creative problem solvingCoaching teams in creative problem solving
Coaching teams in creative problem solving
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
How To Write A SQL Server Performance Review
How To Write A SQL Server Performance ReviewHow To Write A SQL Server Performance Review
How To Write A SQL Server Performance Review
 
Minerva-Master-Thesis-Template-Minimalist-by-Slidecore-rwbiuv.pptx
Minerva-Master-Thesis-Template-Minimalist-by-Slidecore-rwbiuv.pptxMinerva-Master-Thesis-Template-Minimalist-by-Slidecore-rwbiuv.pptx
Minerva-Master-Thesis-Template-Minimalist-by-Slidecore-rwbiuv.pptx
 
Building a JavaScript Library
Building a JavaScript LibraryBuilding a JavaScript Library
Building a JavaScript Library
 
Goodle Developer Days Madrid 2008 - Open Social Update
Goodle Developer Days Madrid 2008 - Open Social UpdateGoodle Developer Days Madrid 2008 - Open Social Update
Goodle Developer Days Madrid 2008 - Open Social Update
 

Recently uploaded

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 

Recently uploaded (20)

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 

Discovering Hot Topics in the Blogosphere

  • 1. M. Platakis1, D. Kotsakos1, D. Gunopulos1 1Department of Informatics and Telecommunications, University of Athens {platakis, d.kotsakos, dg}@di.uoa.gr
  • 2. Introduction  Related Work  Our Approach  Experimental Results  Future Work
  • 3. What is a blog?  Why research on blogs? ◦ 200K new blogs every day ◦ 120M English blogs Blogosphere
  • 4. What is a blog?  Why research on blogs? ◦ 200K new blogs every day ◦ 120M English blogs Blogosphere
  • 5. What is a blog?  Why research on blogs? ◦ 200K new blogs every day ◦ 120M English blogs Blogosphere
  • 6. The Problem: Searching in blogs  Blogs are an instant medium. When something important happens bloggers write about it  Need for an automated searching/analyzing mechanism
  • 7. Each blog post has a timestamp ◦ Not present in traditional web content ◦ Not taken into account by etc.  Popularity of a term may increase in time  This is a burst!  In our work we detect bursty terms and correlations between them… …or at least we try to!
  • 8. This is a burst!
  • 9. Introduction  Related Work  Our Approach  Experimental Results  Future Work
  • 10. J. Kleinberg’s attempt to find meaningful structure in e-mail streams  State transitions mark the bursts in an infinite state probabilistic automaton  Costs assigned to state transitions eliminate short bursts and fragmentation of longer ones
  • 11. HMM Theory  Given an event stream a minimum-cost state sequence is found (dynamic programming)
  • 12. [Bansal et al. (2007)] present Blogscope.net – a system that identifies keyword bursts  A standard normal curve, formed by the popularity values of a term for the last few days, is used to find outliers  Analysis performed on the body of blog posts
  • 13. [Vlachos et al. (2004)]: finding similarities between time-series  [Kumar et al. (2005)], [Chi et al. (2007)]: further analysis of the Blogosphere  [Wang et al. (2007)]: mining of correlated bursty topic patterns  [Zhang et al. (2007)]: detecting bursts  [Agarwal (2008)]: identifying influential bloggers  [Cheng et al. (2008)]: discovering bloggers’ interests  [Ding et al. (2008)]: opinion mining
  • 14. Introduction  Related Work  Our Approach  Experimental Results  Future Work
  • 15. 1. Application on blog data 2. Bursty terms discovery 3. Accuracy evaluation: matching the results with real-life events (if possible)
  • 16. 4. Correlated keywords usually form a topic; our goal is to identify such topics 5. Burstiness = when + how much Similar burstiness  potential term correlation 6. Accuracy evaluation: matching obtained groups of keywords with real-life topics (if possible)
  • 17. Why use Kleinberg’s algorithm? ◦ Sophisticated notion of temporal bursts ◦ Blogscope’s team claim it would be too slow  We experiment on sole blog posts’ titles ◦ To determine if titles are suffice to extract meaningful bursts ◦ To gain some computational time  Titles had to be parsed through the permalinks of the blog posts
  • 18. Introduction  Related Work  Our Approach  Experimental Results  Future Work
  • 19. Bursty keywords examples in comparison with Blogscope.net  Dataset from the first ten days of April 2008 ◦ 30,000 blog post titles  Data pre-processing ◦ stop words removal, downcase, no stemming  Automaton’s output (comes in real time!): “Keyword, weight, start_day, end_day”
  • 20. “Fool” comes out as a bursty term  Output: “fool,17.343,1,2”  Thus, “fool” was bursty on April’s first two days
  • 21. “Fool” comes out as a bursty term  Output: “fool,17.343,1,2”  Thus, “fool” was bursty on April’s first two days
  • 22. “preprocessor” comes out as a bursty term  Output: “preprocessor,85.947,6,6”  Thus, “preprocessor” was bursty on April’s 6th day
  • 23. “preprocessor” comes out as a bursty term  Output: “preprocessor,85.947,6,6”  Thus, “preprocessor” was bursty on April’s 6th day  Numerous spam blog posts were uploaded on April’s 6th day  All contained the term “preprocessor”
  • 24. Searching through the results to identify clusters of correlated keywords  An example: Keyword Weight Start day End day heston 12.213 5 5 charlton 9.929 5 5 actor 4.302 4 5 dead 3.388 5 5 84 3.08 5 5  Bloggers wrote about:
  • 25. Introduction  Related Work  Our Approach  Experimental Results  Future Work
  • 26. Tags + Titles + Multimedia metadata  Development of an automated method to discover correlated keywords  Extraction of popularity curve per keyword  Employment of an appropriate distance metric (time series’ analysis) to compare such curves  Application in micro-blogging and continuous news’ streams