PhD Report
Svitlana Vakulenko
TU Wien
February 15, 2016
Overview
Status of thesis
Relation to other work
Next steps and ideas
Status of thesis
So far so. . .
2014
Topic Modeling
Event Extraction
2015
Target-dependent Sentiment Analysis
Information Diffusion
2016
Breaking News Detection
. . .
Topic Modeling
[Vakulenko et al., 2014, Herbst et al., 2014, Reuter et al., 2014]
@ University of Liechtenstein
Method: Latent Dirichlet Allocation (LDA) [Blei et al., 2003]
Datasets: iTunes, case studies, sustainability reports
Topic Modeling: Results [Vakulenko et al., 2014] 1
Figure : Correspondence chart showing the overlap of LDA topics and
iTunes categories
1
https://ai.wu.ac.at/~vakulenko/
Event Extraction
[Katsios et al., 2015]
Summer School @ NCSR Demokritos
Project: REVEAL EU-FP7 2013-2016
Method: Relation Extraction (ClausIE)
Datasets: FACup, SNOW, World Cup (tweets)
Event Extraction: Results [Katsios et al., 2015]
Figure : Relations extracted from FACup dataset
Target-dependent Sentiment Analysis
@ MODUL University Vienna
Method: POS-, Dependency parsing, ML Classifier (Logistic
Regression)
Datasets: MPQA (news articles), JDPA (product reviews)
Target-dependent Sentiment Analysis: Results
Information Diffusion
@ MODUL University Vienna
Project: PHEME EU-FP7 2014-2017
Method: Relation Extraction
Dataset: news articles, tweets
Information Diffusion: Results
Figure : s: president barack obama – p: state D – o:
Breaking News Detection
@ MODUL University Vienna
Project: InVID EU-Horizon 2016-2019
WP: Social Media Mining
Task: Emergent Topic Detection
Dataset: tweets
Status of thesis
Topics Events
Breaking News Sentiment Analysis
Information Diffusion
Relation to other work
State of the Art
Requirements
Newsworthiness
Scalability
Methodology
Data acquisition
Topic modeling
Event extraction
First story detection
State of the Art
SNOW 2014 Data Challenge confirmed newsworthy topic detection
to be a challenging task [Papadopoulos et al., 2014]2:
F-score: 0.4 Precision: 0.56 Recall: 0.36 [Ifrim et al., 2014]
The limitations of the current state-of-the-art approaches include
early topic detection
topic relevance
topic representation
performance evaluation of the topic detection methods.
The most recent results reported in the related
work [Martin et al., 2015]
2
[Van Canneyt et al., 2014, Martin and G¨oker, 2014, Burnside et al., 2014,
Petkos et al., 2014]
Requirements: Newsworthiness
a set of topics for a given time slot ‘covered in mainstream
news sites’ [Papadopoulos et al., 2014]
’the combination of novelty and
significance‘ [Martin et al., 2015]
One common method to find novel (emerging or recent trending)
topics from a data stream is looking for bursts in frequent
occurrences of keywords and phrases
(n-grams) [Martin et al., 2015, Martin and G¨oker, 2014,
Fujiki et al., 2004, Cataldi et al., 2010, Aiello et al., 2013].
Requirements: Scalability
an important requirement when dealing with the data streams
of a high volume and velocity, e.g. Twitter
BNgram approach [Martin and G¨oker, 2014]: 2 minutes per
topic model for a 15-minutes dataset of tweets
Methodology: Data acquisition
Twitter is the major source of news stream
data [Hu et al., 2012].
Only a few studies focus on other data sources than Twitter
stream, e.g.
Wikipedia [Osborne et al., 2012, Steiner et al., 2013].
New: integration of other social media APIs and cross-media
retrieval, e.g.:
tweets → topics(events) → (youtube) → videos
Methodology: Topic modeling
Topic detection approaches often involve
topic clustering
topic ranking
topic labeling
[Petkos et al., 2014, Martin and G¨oker, 2014,
Van Canneyt et al., 2014, Martin et al., 2015, Ifrim et al., 2014,
Elbagoury et al., 2015].
Methodology: Event extraction
News are often centered around specific events (happenings),
which provide a natural way to group the news
stories [Wu et al., 2015].
There exist several on-line services that mine events from news
articles in different languages:
European Media Monitor3 [Pouliquen et al., 2008];
GDELT project4 [Leetaru and Schrodt, 2013];
Event Registry5 [Leban et al., 2014, Rupnik et al., 2015]
A few approaches to extract open-domain events from tweets were
proposed [Popescu et al., 2011, Ritter et al., 2012,
Katsios et al., 2015], but neither of them supports cross-lingual
linking.
3
http://emm.newsbrief.eu
4
http://www.gdeltproject.org/
5
http://eventregistry.org
Methodology: First story detection
The task of first story detection (FSD) was proposed to identify
the first story about a certain event from a document
stream [Petrovic et al., 2012]. The state-of-the-art FSD
approaches use similarity metrics over documents, such as TF-IDF
vectors or Locality Sensitive Hashing (LSH)
[Petrovic et al., 2012, Phuvipadawat and Murata, 2010], to
determine if candidate documents are close to existing documents
or could constitute a new event.
Next steps and ideas
Project: InVID EU-Horizon 2016-2019
WP: Social Media Mining
Deadline: June 2016 (deliverable)
Agenda:
Data acquisition
Breaking news detection
Evaluation framework: Twitter Trends, [Ifrim et al., 2014]
[Martin et al., 2015]
Methodology: topic modeling, event extraction, (semantic and
cross-lingual) ontology-based integration (e.g. BabelNet)
Progress: social media APIs integration proposal
Bibliography I
Aiello, L., Petkos, G., Martin, C., Corney, D., Papadopoulos,
S., Skraba, R., G¨oker, A., Kompatsiaris, I., and Jaimes, A.
(2013).
Sensing Trending Topics in Twitter.
IEEE Transactions on Multimedia, 15(6):1268–1282.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).
Latent Dirichlet Allocation.
The Journal of Machine Learning Research, 3:993–1022.
Burnside, G., Milioris, D., and Jacquet, P. (2014).
One Day in Twitter: Topic Detection Via Joint Complexity.
Cataldi, M., Di Caro, L., and Schifanella, C. (2010).
Emerging Topic Detection on Twitter Based on Temporal and
Social Terms Evaluation.
MDMKDD, pages 4:1–4:10.
Bibliography II
Elbagoury, A., Ibrahim, R., Farahat, A., Kamel, M., and
Karray, F. (2015).
Exemplar-Based Topic Detection in Twitter Streams.
In International AAAI Conference on Web and Social Media.
Fujiki, T., Nanno, T., Suzuki, Y., and Okumura, M. (2004).
Identification of bursts in a document stream.
In International Workshop on Knowledge Discovery in Data
Streams, pages 55–64.
Herbst, A., Simons, A., Brocke, J. v., Mller, O., Debortoli, S.,
and Vakulenko, S. (2014).
Identifying and Characterizing Topics in Enterprise Content
Management: a Latent Semantic Analysis of Vendor Case
studies.
In 22st European Conference on Information Systems, ECIS.
Bibliography III
Hu, M., Liu, S., Wei, F., Wu, Y., Stasko, J., and Ma, K.-L.
(2012).
Breaking news on twitter.
In Conference on Human Factors in Computing Systems,
pages 2751–2754.
Ifrim, G., Shi, B., and Brigadir, I. (2014).
Event detection in twitter using aggressive filtering and
hierarchical tweet clustering.
In SNOW-DC@ WWW, pages 33–40.
Katsios, G., Vakulenko, S., Krithara, A., and Paliouras, G.
(2015).
Towards open domain event extraction from twitter: Revealing
entity relations.
In DeRiVE@ ESWC, pages 35–46.
Bibliography IV
Leban, G., Fortuna, B., Brank, J., and Grobelnik, M. (2014).
Cross-lingual detection of world events from news articles.
In Proceedings of the ISWC, pages 21–24.
Leetaru, K. and Schrodt, P. A. (2013).
Gdelt: Global data on events, location, and tone, 1979–2012.
In ISA Annual Convention, volume 2, page 4.
Martin, C., Corney, D., and G¨oker, A. (2015).
Mining Newsworthy Topics from Social Media.
In Advances in Social Media Analysis, pages 21–43.
Martin, C. and G¨oker, A. (2014).
Real-time topic detection with bursty n-grams: RGU’s
submission to the 2014 SNOW challenge.
In SNOW-DC@ WWW.
Bibliography V
Osborne, M., Petrovic, S., McCreadie, R., Macdonald, C., and
Ounis, I. (2012).
Bieber no more: First story detection using Twitter and
Wikipedia.
In TAIA.
Papadopoulos, S., Corney, D., and Aiello, L. M. (2014).
Snow 2014 data challenge: Assessing the performance of news
topic detection methods in social media.
In SNOW-DC@ WWW, pages 1–8.
Petkos, G., Papadopoulos, S., and Kompatsiaris, Y. (2014).
Two-level message clustering for topic detection in twitter.
In SNOW-DC@ WWW, pages 49–56.
Bibliography VI
Petrovic, S., Osborne, M., and Lavrenko, V. (2012).
Using paraphrases for improving first story detection in news
and Twitter.
In Conference of the North American Chapter of the
Association for Computational Linguistics, pages 338–346.
Phuvipadawat, S. and Murata, T. (2010).
Breaking News Detection and Tracking in Twitter.
In International Conference on Web Intelligence and Intelligent
Agent Technology (WI-IAT), volume 3, pages 120–123.
Popescu, A.-M., Pennacchiotti, M., and Paranjpe, D. (2011).
Extracting events and event descriptions from twitter.
In WWW, pages 105–106.
Bibliography VII
Pouliquen, B., Steinberger, R., and Deguernel, O. (2008).
Story tracking: linking similar news over time and across
languages.
In Proceedings of the workshop on Multi-source Multilingual
Information Extraction and Summarization, pages 49–56.
Reuter, N., Vakulenko, S., Brocke, J. v., Debortoli, S., and
Mller, O. (2014).
Identifying the Role of Information Systems in Achieving
Energy-Related Environmental Sustainability using Text
Mining.
In 22st European Conference on Information Systems, ECIS.
Ritter, A., Etzioni, O., Clark, S., and others (2012).
Open domain event extraction from twitter.
In SIGKDD, pages 1104–1112.
Bibliography VIII
Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B., and
Grobelnik, M. (2015).
News Across Languages-Cross-Lingual Document Similarity
and Event Tracking.
arXiv preprint arXiv:1512.07046.
Steiner, T., van Hooland, S., and Summers, E. (2013).
MJ No More: Using Concurrent Wikipedia Edit Spikes with
Social Network Plausibility Checks for Breaking News
Detection.
In WWW, pages 791–794.
Vakulenko, S., Mller, O., and Brocke, J. v. (2014).
Enriching iTunes App Store Categories via Topic Modeling.
In Proceedings of the International Conference on Information
Systems ICIS.
Bibliography IX
Van Canneyt, S., Feys, M., Schockaert, S., Demeester, T.,
Develder, C., and Dhoedt, B. (2014).
Detecting newsworthy topics in Twitter.
In SNOW-DC@ WWW, pages 1–8.
Wu, Z., Chen, L., and Giles, C. L. (2015).
Storybase: Towards Building a Knowledge Base for News
Events.
In ACL, pages 133–138.

Vakulenko PhD Status Report - 16 February 2016

  • 1.
    PhD Report Svitlana Vakulenko TUWien February 15, 2016
  • 2.
    Overview Status of thesis Relationto other work Next steps and ideas
  • 3.
    Status of thesis Sofar so. . . 2014 Topic Modeling Event Extraction 2015 Target-dependent Sentiment Analysis Information Diffusion 2016 Breaking News Detection . . .
  • 4.
    Topic Modeling [Vakulenko etal., 2014, Herbst et al., 2014, Reuter et al., 2014] @ University of Liechtenstein Method: Latent Dirichlet Allocation (LDA) [Blei et al., 2003] Datasets: iTunes, case studies, sustainability reports
  • 5.
    Topic Modeling: Results[Vakulenko et al., 2014] 1 Figure : Correspondence chart showing the overlap of LDA topics and iTunes categories 1 https://ai.wu.ac.at/~vakulenko/
  • 6.
    Event Extraction [Katsios etal., 2015] Summer School @ NCSR Demokritos Project: REVEAL EU-FP7 2013-2016 Method: Relation Extraction (ClausIE) Datasets: FACup, SNOW, World Cup (tweets)
  • 7.
    Event Extraction: Results[Katsios et al., 2015] Figure : Relations extracted from FACup dataset
  • 8.
    Target-dependent Sentiment Analysis @MODUL University Vienna Method: POS-, Dependency parsing, ML Classifier (Logistic Regression) Datasets: MPQA (news articles), JDPA (product reviews)
  • 9.
  • 10.
    Information Diffusion @ MODULUniversity Vienna Project: PHEME EU-FP7 2014-2017 Method: Relation Extraction Dataset: news articles, tweets
  • 11.
    Information Diffusion: Results Figure: s: president barack obama – p: state D – o:
  • 12.
    Breaking News Detection @MODUL University Vienna Project: InVID EU-Horizon 2016-2019 WP: Social Media Mining Task: Emergent Topic Detection Dataset: tweets
  • 13.
    Status of thesis TopicsEvents Breaking News Sentiment Analysis Information Diffusion
  • 14.
    Relation to otherwork State of the Art Requirements Newsworthiness Scalability Methodology Data acquisition Topic modeling Event extraction First story detection
  • 15.
    State of theArt SNOW 2014 Data Challenge confirmed newsworthy topic detection to be a challenging task [Papadopoulos et al., 2014]2: F-score: 0.4 Precision: 0.56 Recall: 0.36 [Ifrim et al., 2014] The limitations of the current state-of-the-art approaches include early topic detection topic relevance topic representation performance evaluation of the topic detection methods. The most recent results reported in the related work [Martin et al., 2015] 2 [Van Canneyt et al., 2014, Martin and G¨oker, 2014, Burnside et al., 2014, Petkos et al., 2014]
  • 16.
    Requirements: Newsworthiness a setof topics for a given time slot ‘covered in mainstream news sites’ [Papadopoulos et al., 2014] ’the combination of novelty and significance‘ [Martin et al., 2015] One common method to find novel (emerging or recent trending) topics from a data stream is looking for bursts in frequent occurrences of keywords and phrases (n-grams) [Martin et al., 2015, Martin and G¨oker, 2014, Fujiki et al., 2004, Cataldi et al., 2010, Aiello et al., 2013].
  • 17.
    Requirements: Scalability an importantrequirement when dealing with the data streams of a high volume and velocity, e.g. Twitter BNgram approach [Martin and G¨oker, 2014]: 2 minutes per topic model for a 15-minutes dataset of tweets
  • 18.
    Methodology: Data acquisition Twitteris the major source of news stream data [Hu et al., 2012]. Only a few studies focus on other data sources than Twitter stream, e.g. Wikipedia [Osborne et al., 2012, Steiner et al., 2013]. New: integration of other social media APIs and cross-media retrieval, e.g.: tweets → topics(events) → (youtube) → videos
  • 19.
    Methodology: Topic modeling Topicdetection approaches often involve topic clustering topic ranking topic labeling [Petkos et al., 2014, Martin and G¨oker, 2014, Van Canneyt et al., 2014, Martin et al., 2015, Ifrim et al., 2014, Elbagoury et al., 2015].
  • 20.
    Methodology: Event extraction Newsare often centered around specific events (happenings), which provide a natural way to group the news stories [Wu et al., 2015]. There exist several on-line services that mine events from news articles in different languages: European Media Monitor3 [Pouliquen et al., 2008]; GDELT project4 [Leetaru and Schrodt, 2013]; Event Registry5 [Leban et al., 2014, Rupnik et al., 2015] A few approaches to extract open-domain events from tweets were proposed [Popescu et al., 2011, Ritter et al., 2012, Katsios et al., 2015], but neither of them supports cross-lingual linking. 3 http://emm.newsbrief.eu 4 http://www.gdeltproject.org/ 5 http://eventregistry.org
  • 21.
    Methodology: First storydetection The task of first story detection (FSD) was proposed to identify the first story about a certain event from a document stream [Petrovic et al., 2012]. The state-of-the-art FSD approaches use similarity metrics over documents, such as TF-IDF vectors or Locality Sensitive Hashing (LSH) [Petrovic et al., 2012, Phuvipadawat and Murata, 2010], to determine if candidate documents are close to existing documents or could constitute a new event.
  • 22.
    Next steps andideas Project: InVID EU-Horizon 2016-2019 WP: Social Media Mining Deadline: June 2016 (deliverable) Agenda: Data acquisition Breaking news detection Evaluation framework: Twitter Trends, [Ifrim et al., 2014] [Martin et al., 2015] Methodology: topic modeling, event extraction, (semantic and cross-lingual) ontology-based integration (e.g. BabelNet) Progress: social media APIs integration proposal
  • 23.
    Bibliography I Aiello, L.,Petkos, G., Martin, C., Corney, D., Papadopoulos, S., Skraba, R., G¨oker, A., Kompatsiaris, I., and Jaimes, A. (2013). Sensing Trending Topics in Twitter. IEEE Transactions on Multimedia, 15(6):1268–1282. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3:993–1022. Burnside, G., Milioris, D., and Jacquet, P. (2014). One Day in Twitter: Topic Detection Via Joint Complexity. Cataldi, M., Di Caro, L., and Schifanella, C. (2010). Emerging Topic Detection on Twitter Based on Temporal and Social Terms Evaluation. MDMKDD, pages 4:1–4:10.
  • 24.
    Bibliography II Elbagoury, A.,Ibrahim, R., Farahat, A., Kamel, M., and Karray, F. (2015). Exemplar-Based Topic Detection in Twitter Streams. In International AAAI Conference on Web and Social Media. Fujiki, T., Nanno, T., Suzuki, Y., and Okumura, M. (2004). Identification of bursts in a document stream. In International Workshop on Knowledge Discovery in Data Streams, pages 55–64. Herbst, A., Simons, A., Brocke, J. v., Mller, O., Debortoli, S., and Vakulenko, S. (2014). Identifying and Characterizing Topics in Enterprise Content Management: a Latent Semantic Analysis of Vendor Case studies. In 22st European Conference on Information Systems, ECIS.
  • 25.
    Bibliography III Hu, M.,Liu, S., Wei, F., Wu, Y., Stasko, J., and Ma, K.-L. (2012). Breaking news on twitter. In Conference on Human Factors in Computing Systems, pages 2751–2754. Ifrim, G., Shi, B., and Brigadir, I. (2014). Event detection in twitter using aggressive filtering and hierarchical tweet clustering. In SNOW-DC@ WWW, pages 33–40. Katsios, G., Vakulenko, S., Krithara, A., and Paliouras, G. (2015). Towards open domain event extraction from twitter: Revealing entity relations. In DeRiVE@ ESWC, pages 35–46.
  • 26.
    Bibliography IV Leban, G.,Fortuna, B., Brank, J., and Grobelnik, M. (2014). Cross-lingual detection of world events from news articles. In Proceedings of the ISWC, pages 21–24. Leetaru, K. and Schrodt, P. A. (2013). Gdelt: Global data on events, location, and tone, 1979–2012. In ISA Annual Convention, volume 2, page 4. Martin, C., Corney, D., and G¨oker, A. (2015). Mining Newsworthy Topics from Social Media. In Advances in Social Media Analysis, pages 21–43. Martin, C. and G¨oker, A. (2014). Real-time topic detection with bursty n-grams: RGU’s submission to the 2014 SNOW challenge. In SNOW-DC@ WWW.
  • 27.
    Bibliography V Osborne, M.,Petrovic, S., McCreadie, R., Macdonald, C., and Ounis, I. (2012). Bieber no more: First story detection using Twitter and Wikipedia. In TAIA. Papadopoulos, S., Corney, D., and Aiello, L. M. (2014). Snow 2014 data challenge: Assessing the performance of news topic detection methods in social media. In SNOW-DC@ WWW, pages 1–8. Petkos, G., Papadopoulos, S., and Kompatsiaris, Y. (2014). Two-level message clustering for topic detection in twitter. In SNOW-DC@ WWW, pages 49–56.
  • 28.
    Bibliography VI Petrovic, S.,Osborne, M., and Lavrenko, V. (2012). Using paraphrases for improving first story detection in news and Twitter. In Conference of the North American Chapter of the Association for Computational Linguistics, pages 338–346. Phuvipadawat, S. and Murata, T. (2010). Breaking News Detection and Tracking in Twitter. In International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), volume 3, pages 120–123. Popescu, A.-M., Pennacchiotti, M., and Paranjpe, D. (2011). Extracting events and event descriptions from twitter. In WWW, pages 105–106.
  • 29.
    Bibliography VII Pouliquen, B.,Steinberger, R., and Deguernel, O. (2008). Story tracking: linking similar news over time and across languages. In Proceedings of the workshop on Multi-source Multilingual Information Extraction and Summarization, pages 49–56. Reuter, N., Vakulenko, S., Brocke, J. v., Debortoli, S., and Mller, O. (2014). Identifying the Role of Information Systems in Achieving Energy-Related Environmental Sustainability using Text Mining. In 22st European Conference on Information Systems, ECIS. Ritter, A., Etzioni, O., Clark, S., and others (2012). Open domain event extraction from twitter. In SIGKDD, pages 1104–1112.
  • 30.
    Bibliography VIII Rupnik, J.,Muhic, A., Leban, G., Skraba, P., Fortuna, B., and Grobelnik, M. (2015). News Across Languages-Cross-Lingual Document Similarity and Event Tracking. arXiv preprint arXiv:1512.07046. Steiner, T., van Hooland, S., and Summers, E. (2013). MJ No More: Using Concurrent Wikipedia Edit Spikes with Social Network Plausibility Checks for Breaking News Detection. In WWW, pages 791–794. Vakulenko, S., Mller, O., and Brocke, J. v. (2014). Enriching iTunes App Store Categories via Topic Modeling. In Proceedings of the International Conference on Information Systems ICIS.
  • 31.
    Bibliography IX Van Canneyt,S., Feys, M., Schockaert, S., Demeester, T., Develder, C., and Dhoedt, B. (2014). Detecting newsworthy topics in Twitter. In SNOW-DC@ WWW, pages 1–8. Wu, Z., Chen, L., and Giles, C. L. (2015). Storybase: Towards Building a Knowledge Base for News Events. In ACL, pages 133–138.