@ WU Reading Group
* Status of thesis
* Relation to other work
* Next steps and ideas
Topic Modeling, Event Extraction, Target-dependent Sentiment Analysis...
3. Status of thesis
So far so. . .
2014
Topic Modeling
Event Extraction
2015
Target-dependent Sentiment Analysis
Information Diffusion
2016
Breaking News Detection
. . .
4. Topic Modeling
[Vakulenko et al., 2014, Herbst et al., 2014, Reuter et al., 2014]
@ University of Liechtenstein
Method: Latent Dirichlet Allocation (LDA) [Blei et al., 2003]
Datasets: iTunes, case studies, sustainability reports
5. Topic Modeling: Results [Vakulenko et al., 2014] 1
Figure : Correspondence chart showing the overlap of LDA topics and
iTunes categories
1
https://ai.wu.ac.at/~vakulenko/
6. Event Extraction
[Katsios et al., 2015]
Summer School @ NCSR Demokritos
Project: REVEAL EU-FP7 2013-2016
Method: Relation Extraction (ClausIE)
Datasets: FACup, SNOW, World Cup (tweets)
14. Relation to other work
State of the Art
Requirements
Newsworthiness
Scalability
Methodology
Data acquisition
Topic modeling
Event extraction
First story detection
15. State of the Art
SNOW 2014 Data Challenge confirmed newsworthy topic detection
to be a challenging task [Papadopoulos et al., 2014]2:
F-score: 0.4 Precision: 0.56 Recall: 0.36 [Ifrim et al., 2014]
The limitations of the current state-of-the-art approaches include
early topic detection
topic relevance
topic representation
performance evaluation of the topic detection methods.
The most recent results reported in the related
work [Martin et al., 2015]
2
[Van Canneyt et al., 2014, Martin and G¨oker, 2014, Burnside et al., 2014,
Petkos et al., 2014]
16. Requirements: Newsworthiness
a set of topics for a given time slot ‘covered in mainstream
news sites’ [Papadopoulos et al., 2014]
’the combination of novelty and
significance‘ [Martin et al., 2015]
One common method to find novel (emerging or recent trending)
topics from a data stream is looking for bursts in frequent
occurrences of keywords and phrases
(n-grams) [Martin et al., 2015, Martin and G¨oker, 2014,
Fujiki et al., 2004, Cataldi et al., 2010, Aiello et al., 2013].
17. Requirements: Scalability
an important requirement when dealing with the data streams
of a high volume and velocity, e.g. Twitter
BNgram approach [Martin and G¨oker, 2014]: 2 minutes per
topic model for a 15-minutes dataset of tweets
18. Methodology: Data acquisition
Twitter is the major source of news stream
data [Hu et al., 2012].
Only a few studies focus on other data sources than Twitter
stream, e.g.
Wikipedia [Osborne et al., 2012, Steiner et al., 2013].
New: integration of other social media APIs and cross-media
retrieval, e.g.:
tweets → topics(events) → (youtube) → videos
19. Methodology: Topic modeling
Topic detection approaches often involve
topic clustering
topic ranking
topic labeling
[Petkos et al., 2014, Martin and G¨oker, 2014,
Van Canneyt et al., 2014, Martin et al., 2015, Ifrim et al., 2014,
Elbagoury et al., 2015].
20. Methodology: Event extraction
News are often centered around specific events (happenings),
which provide a natural way to group the news
stories [Wu et al., 2015].
There exist several on-line services that mine events from news
articles in different languages:
European Media Monitor3 [Pouliquen et al., 2008];
GDELT project4 [Leetaru and Schrodt, 2013];
Event Registry5 [Leban et al., 2014, Rupnik et al., 2015]
A few approaches to extract open-domain events from tweets were
proposed [Popescu et al., 2011, Ritter et al., 2012,
Katsios et al., 2015], but neither of them supports cross-lingual
linking.
3
http://emm.newsbrief.eu
4
http://www.gdeltproject.org/
5
http://eventregistry.org
21. Methodology: First story detection
The task of first story detection (FSD) was proposed to identify
the first story about a certain event from a document
stream [Petrovic et al., 2012]. The state-of-the-art FSD
approaches use similarity metrics over documents, such as TF-IDF
vectors or Locality Sensitive Hashing (LSH)
[Petrovic et al., 2012, Phuvipadawat and Murata, 2010], to
determine if candidate documents are close to existing documents
or could constitute a new event.
22. Next steps and ideas
Project: InVID EU-Horizon 2016-2019
WP: Social Media Mining
Deadline: June 2016 (deliverable)
Agenda:
Data acquisition
Breaking news detection
Evaluation framework: Twitter Trends, [Ifrim et al., 2014]
[Martin et al., 2015]
Methodology: topic modeling, event extraction, (semantic and
cross-lingual) ontology-based integration (e.g. BabelNet)
Progress: social media APIs integration proposal
23. Bibliography I
Aiello, L., Petkos, G., Martin, C., Corney, D., Papadopoulos,
S., Skraba, R., G¨oker, A., Kompatsiaris, I., and Jaimes, A.
(2013).
Sensing Trending Topics in Twitter.
IEEE Transactions on Multimedia, 15(6):1268–1282.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).
Latent Dirichlet Allocation.
The Journal of Machine Learning Research, 3:993–1022.
Burnside, G., Milioris, D., and Jacquet, P. (2014).
One Day in Twitter: Topic Detection Via Joint Complexity.
Cataldi, M., Di Caro, L., and Schifanella, C. (2010).
Emerging Topic Detection on Twitter Based on Temporal and
Social Terms Evaluation.
MDMKDD, pages 4:1–4:10.
24. Bibliography II
Elbagoury, A., Ibrahim, R., Farahat, A., Kamel, M., and
Karray, F. (2015).
Exemplar-Based Topic Detection in Twitter Streams.
In International AAAI Conference on Web and Social Media.
Fujiki, T., Nanno, T., Suzuki, Y., and Okumura, M. (2004).
Identification of bursts in a document stream.
In International Workshop on Knowledge Discovery in Data
Streams, pages 55–64.
Herbst, A., Simons, A., Brocke, J. v., Mller, O., Debortoli, S.,
and Vakulenko, S. (2014).
Identifying and Characterizing Topics in Enterprise Content
Management: a Latent Semantic Analysis of Vendor Case
studies.
In 22st European Conference on Information Systems, ECIS.
25. Bibliography III
Hu, M., Liu, S., Wei, F., Wu, Y., Stasko, J., and Ma, K.-L.
(2012).
Breaking news on twitter.
In Conference on Human Factors in Computing Systems,
pages 2751–2754.
Ifrim, G., Shi, B., and Brigadir, I. (2014).
Event detection in twitter using aggressive filtering and
hierarchical tweet clustering.
In SNOW-DC@ WWW, pages 33–40.
Katsios, G., Vakulenko, S., Krithara, A., and Paliouras, G.
(2015).
Towards open domain event extraction from twitter: Revealing
entity relations.
In DeRiVE@ ESWC, pages 35–46.
26. Bibliography IV
Leban, G., Fortuna, B., Brank, J., and Grobelnik, M. (2014).
Cross-lingual detection of world events from news articles.
In Proceedings of the ISWC, pages 21–24.
Leetaru, K. and Schrodt, P. A. (2013).
Gdelt: Global data on events, location, and tone, 1979–2012.
In ISA Annual Convention, volume 2, page 4.
Martin, C., Corney, D., and G¨oker, A. (2015).
Mining Newsworthy Topics from Social Media.
In Advances in Social Media Analysis, pages 21–43.
Martin, C. and G¨oker, A. (2014).
Real-time topic detection with bursty n-grams: RGU’s
submission to the 2014 SNOW challenge.
In SNOW-DC@ WWW.
27. Bibliography V
Osborne, M., Petrovic, S., McCreadie, R., Macdonald, C., and
Ounis, I. (2012).
Bieber no more: First story detection using Twitter and
Wikipedia.
In TAIA.
Papadopoulos, S., Corney, D., and Aiello, L. M. (2014).
Snow 2014 data challenge: Assessing the performance of news
topic detection methods in social media.
In SNOW-DC@ WWW, pages 1–8.
Petkos, G., Papadopoulos, S., and Kompatsiaris, Y. (2014).
Two-level message clustering for topic detection in twitter.
In SNOW-DC@ WWW, pages 49–56.
28. Bibliography VI
Petrovic, S., Osborne, M., and Lavrenko, V. (2012).
Using paraphrases for improving first story detection in news
and Twitter.
In Conference of the North American Chapter of the
Association for Computational Linguistics, pages 338–346.
Phuvipadawat, S. and Murata, T. (2010).
Breaking News Detection and Tracking in Twitter.
In International Conference on Web Intelligence and Intelligent
Agent Technology (WI-IAT), volume 3, pages 120–123.
Popescu, A.-M., Pennacchiotti, M., and Paranjpe, D. (2011).
Extracting events and event descriptions from twitter.
In WWW, pages 105–106.
29. Bibliography VII
Pouliquen, B., Steinberger, R., and Deguernel, O. (2008).
Story tracking: linking similar news over time and across
languages.
In Proceedings of the workshop on Multi-source Multilingual
Information Extraction and Summarization, pages 49–56.
Reuter, N., Vakulenko, S., Brocke, J. v., Debortoli, S., and
Mller, O. (2014).
Identifying the Role of Information Systems in Achieving
Energy-Related Environmental Sustainability using Text
Mining.
In 22st European Conference on Information Systems, ECIS.
Ritter, A., Etzioni, O., Clark, S., and others (2012).
Open domain event extraction from twitter.
In SIGKDD, pages 1104–1112.
30. Bibliography VIII
Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B., and
Grobelnik, M. (2015).
News Across Languages-Cross-Lingual Document Similarity
and Event Tracking.
arXiv preprint arXiv:1512.07046.
Steiner, T., van Hooland, S., and Summers, E. (2013).
MJ No More: Using Concurrent Wikipedia Edit Spikes with
Social Network Plausibility Checks for Breaking News
Detection.
In WWW, pages 791–794.
Vakulenko, S., Mller, O., and Brocke, J. v. (2014).
Enriching iTunes App Store Categories via Topic Modeling.
In Proceedings of the International Conference on Information
Systems ICIS.
31. Bibliography IX
Van Canneyt, S., Feys, M., Schockaert, S., Demeester, T.,
Develder, C., and Dhoedt, B. (2014).
Detecting newsworthy topics in Twitter.
In SNOW-DC@ WWW, pages 1–8.
Wu, Z., Chen, L., and Giles, C. L. (2015).
Storybase: Towards Building a Knowledge Base for News
Events.
In ACL, pages 133–138.